MULTI-USER REINFORCEMENT LEARNING WITH LOW RANK REWARDS

Abstract

We consider the problem of collaborative multi-user reinforcement learning. In this setting there are multiple users with the same state-action space and transition probabilities but with different rewards. Under the assumption that the reward matrix of the N users has a low-rank structure -a standard and practically successful assumption in the offline collaborative filtering setting -we design algorithms with significantly lower sample complexity compared to the ones that learn the MDP individually for each user. Our main contribution is an algorithm which explores rewards collaboratively with N user-specific MDPs and can learn rewards efficiently in two key settings: tabular MDPs and linear MDPs. When N is large and the rank is constant, the sample complexity per MDP depends logarithmically over the size of the state-space, which represents an exponential reduction (in the state-space size) when compared to the standard "non-collaborative" algorithms.

1. INTRODUCTION

Reinforcement learning has recently seen tremendous empirical and theoretical success Mnih et al. (2015) ; Sutton et al. (1992) ; Jin et al. (2020b) ; Gheshlaghi Azar et al. (2013) ; Dann & Brunskill (2015) . Near optimal algorithms have been proposed to explore and learn a given MDP with sample access to trajectories. In this work, we consider the problem of learning the optimal policies for multiple MDPs collaboratively so that the total number of trajectories sampled per MDP is smaller than the number of trajectories required to learn them individually. This combines reinforcement learning and collaborative filtering. We assume that the various users have the same transition matrices, but different rewards and the rewards have a low rank structure. Low rank assumption is popular in the collaborative filtering literature and has been deployed successfully in a variety of tasks Bell & Koren (2007) ; Gleich & Lim (2011) ; Hsieh et al. (2012) . This can be regarded as an instance of multi-task reinforcement, various versions of which have been considered in the literature Brunskill & Li (2013) ; D 'Eramo et al. (2020); Teh et al. (2017) ; Hessel et al. (2019) . Motivation Recently, collaborative filtering has been studied in the online learning setting: Bresler & Karzand (2021) ; Jain & Pal (2022) . Here multiple bandit instances are simultaneously explored under low rank assumptions. In this work, we extend this setting to consider stateful modeling of such systems. In the context of e-commerce, this can allow the algorithm to discover temporal patterns like 'User bought a Phone and hence they might be eventually interested in phone cover' or 'User last bought shoes many years ago which might be worn out by now, therefore recommend shoes'. Note that the fact that a user has bought a shoe changes their preferences (and hence the reward function). Our setting allows one to model such changes. While we assume that the users share the same transition matrix, this can be relaxed in practice by clustering users based on side information and modeling each cluster to have a common transition matrix. This approach has been successfully deployed in various multi-agent RL problems in practice, including in sensitive healthcare settings (see Mate et al. (2022) and reference therein). Our Contributions a) Improved Sample Complexity: We introduce the setting of multi-user collaborative reinforcement learning in the case of tabular and linear MDPs and provide sample efficient algorithms for both these scenarios without access to a generative model. Under low rank assumption, the total sample complexity required to learn the near-optimal policies for every user scales as Õ(N + |S||A|) instead of O(N |S||A|) in the case of tabular MDPs and Õ(N + d) instead of O(N d 2 ) in the case of linear MDPs. b) Collaborative Exploration: In order to learn the rewards of all the users efficiently under lowrank assumptions, we need to deploy standard low rank matrix estimation/completion algorithms, which require specific kinds of linear measurements (See Section 1.1). Without access to a generative model, the main challenge in this setting is to obtain these linear measurements by querying trajectories of carefully designed policies. We design such algorithms in Section 3. c) Functional Reward Maximization: In the case of linear MDPs, matrix completion is more challenging since we observe measurements of the form e ⊺ i Θ * ψ where Θ * ∈ R N ×d , corresponding to the reward obtained by user i, with respect to an embedding ψ. Estimating Θ * under low rank assumptions requires the distribution of ψ to have certain isotropy properties (see Section 6). Querying such measurements goes beyond the usual reward maximization and are related to mean-field limits of multi-agent reinforcement learning similar to the setting in Cammardella et al. (2020) where a functional of the distribution of the states is optimized. We design a procedure which can sampleefficiently estimate policies which lead to these isotropic measurements (Section 5). d) Matrix Completion With Row-Wise Linear Measurements: For the linear MDP setting, the low rank matrix estimation problem lies somewhere in between the matrix completion (Recht, 2011; Jain et al., 2013) and matrix estimation with restricted strong convexity (Negahban et al., 2009) . We give a novel active learning based algorithm where we estimate Θ * row by row without any assumptions like incoherence. This is described in Section 6.

1.1. RELATED WORKS

Related Settings: Multi-task Reinforcement learning has been studied empirically and theoretically Brunskill & Li (2013) ; Taylor & Stone (2009) ; D 'Eramo et al. (2020) ; Teh et al. (2017) ; Hessel et al. (2019) ; Sodhani et al. (2021) . Modi et al. (2017) considers learning a sequence of MDPs with side information, where the parameters of the MDP varies smoothly with the context. Shah et al. (2020) introduced the setting where the optimal Q function Q * (s, a), when represented as a S × A matrix has low rank. With a generative model, they obtain algorithms which makes use of this structure to obtain a smaller sample complexity whenever the discount factor is bounded by a constant. Sam et al. (2022) improves the results in this setting by considering additional assumptions like low rank transition kernels. Our setting is different in that we consider multiple users, but do not consider access to a generative model. In fact our main contribution is to efficiently obtain measurements conducive to matrix completion without a generative model. Hu et al. (2021) considers a multitask RL problem with linear function approximation similar to our setting, but with the assumption of low-rank Bellman closure, where the application of the Bellman operator retains the low rank structure. They obtain a bound depending on the quantity N √ d instead of (N + d) like in our work. Lei & Li (2019) considers multi-user RL with low rank assumptions in an experimental context. Low Rank Matrix Estimation: Low rank matrix estimation has been extensively studied in the statistics and ML community for decades in the context of supervised learning Candès & Tao (2010); Negahban & Wainwright (2011) ; Fazel (2002) ; Chen et al. (2019) ; Jain et al. (2013; 2017) ; Recht (2011) ; Chen et al. (2020) ; Chi et al. (2019) in multi-user collaborative filtering settings. The basic question is to estimate a d 1 ×d 2 matrix M given linear measurements (x ⊺ i M y i ) n i=1 when the number of samples is much smaller than d 1 × d 2 using the assumption that M has low rank. a) Matrix Completion: x i and y i are standard basis vectors. Typically x i and y i are picked uniformly at random and recovery guarantees are given whenever the matrix M is incoherent (Recht, 2011) . b) Matrix Estimation: x i and y i are not restricted to be standard basis vectors. Typically, they are chosen i.i.d such that the restricted strong convexity holds (Negahban et al., 2009) . In this work, we consider MDPs associated with N users such that their reward matrix satisfies a low rank structure. For the case of tabular MDPs, we use the matrix completion setting and for the case of linear MDPs, our setting lies some where in between settings a) and b) as explained above.

1.2. NOTATION

By ∥ • ∥ we denote the Euclidean norm and by e 1 , . . . , e i , . . . the standard basis vectors of the space R m for some m ∈ N. Let S d-1 := {x ∈ R d : ∥x∥ = 1}. Let B d (r) := {x ∈ R d : ∥x∥ ≤ r}. For any m × n matrix A and a set Ω ⊆ [n] by A Ω , we denote the sub-matrix of A where the columns corresponding to Ω ∁ are deleted. By ∆(A), we denote the set of all Borel probability measures over the set A. In the sequel,

2. PROBLEM SETTING

We consider N users indexed by [N ] , each of them associated with a horizon H MDP with the same state-space S, the same action space A and same transition matrices P = (P 1 , . . . , P H-1 ), where P h (•|s h , a h ) is a probability measure over S, which gives the distribution of the state at time h + 1 given the action a h was taken in state s h at time h. Each user has a different reward denoted by R u = (R 1u , . . . , R Hu ) where R hu : S × A → [0, 1]. Denote the MDP associated with the user u by M u := (S, A, P, R u ). For the sake of simplicity, we will assume that the rewards are deterministic. We assume that all the MDPs start at a random state S 1 with the same distribution. Consider a policy Π := (π 1 , . . . , π H ) where π h : ∆(A) × S → R + is a kernel -i.e, π h (•|s) gives the probability distribution over actions given a state s at time h. By (S 1:H , A 1:H ) we denote the trajectory (S 1 , A 1 ), (S 2 , A 2 ), . . . , (S H , A H ) ∈ S × A. By (S 1:H , A 1:H ) ∼ M(Π) we mean the random trajectory under the policy Π -where A h ∼ π h (•|S h ) and S h+1 ∼ P h (•|S h , A h ). That is, it is the trajectory of the MDP under the policy Π. Define the value function of M u under policy Π as: V (Π, M u ) := E (S 1:H ,A 1:H )∼Π H h=1 R hu (S h , A h ). We will call a policy Πu to be ϵ optimal for M u if V ( Πu , M u ) ≥ sup Π V (Π, M u ) -ϵ. Our goal is to find ϵ optimal policies for every u ∈ [N ] with as few samples as possible under low rank assumptions on the rewards R uh . Reward Free Exploration: The objective of reward free RL is to explore an MDP such that we can obtain the optimal policy for every possible reward. After collecting K trajectories from the MDP sequentially (denoted by D K ), the algorithm outputs functions Π and V whose input is a reward function R = (R h (•, •)) H h=1 (bounded between [0, 1] ) and the output is a nearly-optimal policy Π(R) and its estimated value V ( Π(R)) for this reward function. Denote the MDP with this reward function by M R . Given ϵ > 0 and δ ∈ [0, 1], we let K rf (ϵ, δ) to be such that whenever K ≥ K rf (ϵ, δ), with probability at-least 1 -δ we have: a) sup R |V ( Π, M R ) -V ( Π(R))| ≤ ϵ and b) Π is an ϵ optimal policy for M R for every R. This setting was introduced in Jin et al. (2020a) . In this work, we will use the reward free exploration algorithms in Zhang et al. (2020) for tabular MDPs and Wagenmaker et al. (2022) for linear MDPs. Tabular MDP Setting S and A are finite sets. Denote the reward R hu (s, a) by the N × |S||A| matrix R h where R h (u, (s, a)) = R hu (s, a). We have the following low-rank assumption: Assumption (Tab) 1. The matrix R h has rank r for some r ≤ 1 2 min(N, |S||A|). Linear MDP Setting Our definition is slightly different from the one in Jin et al. (2020b) . In this setting, we consider embeddings ϕ : S × A → R d , ψ : S × A → R d such that ∥ϕ(s, a)∥ 1 ≤ 1, ∥ψ(s, a)∥ 2 ≤ 1. We make the following assumptions: 1. There exists θ hu ∈ R d , ∥θ hu ∥ 2 ≤ √ d such that R hu (s, a) = ⟨θ hu , ψ(s, a)⟩ and R h (s, a) ∈ [0, 1]. 2. There exist signed measures µ 1h , . . . , µ dh over the space S such that: P h (•|s, a) = d i=1 µ ih (•)⟨ϕ(s, a), e i ⟩ We will assume that µ i are such that ∥ µ ih (ds)ϕ(s, a)π(da|s)∥ 1 ≤ 1 and sup i,h |µ ih (ds)| ≤ C µ . This is true whenever µ ih are probability measures. We consider different embeddings for transition (ϕ) and reward (ψ) as the transition embeddings have a natural ∥ • ∥ 1 structure since they give linear combinations of measure which make up P h (•|s, a). We denote the N × d matrix whose u-th row is θ ⊺ uh to be Θ h . The low-rank assumption in this setting takes the following form: Assumption (Lin) 1. . The N × d matrix Θ h has rank r ≤ 1 2 min(N, d). For the task of reward maximization in Linear MDPs, deterministic policies of the form π h (s) = arg sup a ⟨ψ(s, a), u * h ⟩ + ⟨ϕ(s, a), v * h ⟩ suffice, for u * h , v * h ∈ R d . We want to complete the matrix Θ h , with data of the form (e i , ψ(S h , A h ), e ⊺ i Θ h ψ(S h , A h )). To achieve matrix estimation, we need to query (S h , A h ) such that the distribution of ψ(S h , A h ) is 'nearly isotropic' (See Section 6). Such (S h , A h ) cannot necessarily be obtained as a result of reward maximization policy of a single agent and is related to mean-field limit of multi-agent RL (similar to the setting in Cammardella et al. (2020) ), and it could necessarily be a randomized policy. However, the space of all possible policies can be very large and intractable. Therefore, we restrict our attention to policies given by some fixed policy space Q. With some abuse of notation, we define the total variation distance between two kernels as: TV(π h , π ′ h ) := sup s∈S TV(π h (•|s), π ′ h (•|s)). We define a distance over Q by D Q (Π (1) , Π (2) ) = sup h∈[H] TV(π (1) h , π (2) h ), where Π (i) = (π (i) h : h ∈ [H] ). We refer to Section A for additional discussion on the above observations, connections to multi-agent mean-field RL and construction of Q such that it contains all ϵ-optimal policies for every possible linear reward.

3. THE ALGORITHM

Our algorithm proceeds in 4 phases. Note that since all the users in our setting have the same transition probabilities, we can run reward free RL (Phase 1) in a distributed fashion over the users. Reward free exploration is useful in this setting since all users share the same MDP and the main unknown is the reward matrix. Reward free exploration can be done collectively by selecting random users instead of the same user, which reduces the per user complexity. This is then used in two ways: 1) Collaboratively exploring the space in order to complete the reward matrix (Phase 2) 2) Learning the optimal reward for every user once the reward matrix is known (Phase 4).

3.1. TABULAR MDP CASE:

Phase 1: Reward Free Exploration We run the reward free RL algorithm in Zhang et al. (2020) for K rf ( ϵ 8 , δ 2 ) = C |S||A|H 2 |S|+log( 1 δ ) ϵ 2 polylog( |S||A|H ϵ ) time steps by picking the MDP corresponding to a uniformly random user whenever the reward free RL algorithm queries a trajectory. Let the output of the reward free RL algorithm be Π and V . Phase 2: Querying the Reward Matrix In this phase we query a 'uniform mask' with the parameter p for the reward matrix R h using Algorithm 1. For each (s, a) ∈ S × A and h ∈ [H], maintain a counter T h,(s,a) for (s, a) ∈ S × A and h ∈ [H], initialized at 0. Given the 'active sets' G = (G h ) h∈[H] ⊆ S × A and h ∈ [H], we define the reward J (; G) = (J 1 , . . . , J h ) by J h (s, a; G) = 1((s, a) ∈ G h ) (1) We will denote this reward by J (•; G). Initialize active set G = (G h ) h∈ [H] such that G h = S × A. We initialize the reward matrix Rh (u, (s, a)) = * , where * denotes unknown entry. This algorithm terminates when it detects that sufficient number of samples have been collected for matrix completion. Phase 3: Reward Matrix Completion We receive G h and the partially observed matrix Rh for each h as the output of Algorithm 1. By RG ∁ h h , we denote the sub-matrix where the columns corresponding to G h are deleted. We use the nuclear norm minimization algorithm given in Recht (2011)  h = R G ∁ h h and RG h h = 0. We compute the optimal policy for each user using the rewards from Rh via the output of the reward free RL, Π, from Phase 1.

3.2. LINEAR MDP CASE:

Phase 1 : Reward Free RL We run the reward free RL algorithm for Linear MDPs from Wagenmaker et al. (2022) , with error ϵ and probability of failure δ 4 . We use trajectories from random users whenever a trajectory is queried. Here, K rf (ϵ, δ/4) = CdH 5 (d+log( 1 δ )) ϵ 2 + Cd 9/2 H 6 log 4 ( 1 δ ) ϵ . Output: Active sets G = (G h ) h∈[H] , Partially complete matrix Rh t ← 0 ; P G ← V (J (•; G)), ΠG ← Π(J (•; G)); while P G > ϵ 2 do U t ← Unif([N ]) ; // Pick a user uniformly at random S (t) 1:H , A (t) 1:H , R (t) 1:H ∼ M Ut ( ΠG ) ; // Query trajectory for h ∈ [H] do if (S (t) h , A (t) h ) ∈ G h and R h (U t , (S (t) h , A (t) h )) = * then T h,(S (t) h ,A (t) h ) ← T h,(S (t) h ,A (t) h ) + 1 ; // Increment count Rh (U t , (S (t) h , A (t) h )) ← R (t) h ; // Fill Missing Entry end if T h,(S (t) h ,A (t) h ) = N p then G h ← G h \ {(S (t) h , A (t) h )} ; // Remove element from Active Set end end t ← t + 1 ; P G ← V (J (•; G)); ΠG ← Π(J (•; G)); end Algorithm 1: Uniform Mask Sampler for Tabular MDPs Phase 2: Querying Linear Measurements of the Reward Matrix We obtain policies whose trajectory data allows low rank matrix estimation of the reward matrix. Step 1: For each time step h ∈ [H], we want to query obtain samples (s (t) h , a h ) such that T t=1 ϕ(s (t) h , a (t) h )ϕ ⊺ (s (t) h , a h ) ⪰ κ 2 I. This can be done by Algorithm 2. Given a projector Q to some subspace of R d , by Q h,Q denote the reward ∥Qϕ(s, a)∥ 2 at time h and 0 otherwise. The termination condition ensures that we see enough data in all directions ϕ, which allows us to find collaborative exploration policy below. Step 2: Using the observations given in Step 1, we compute the policy Πf,h which approximately satisfies the property given in Assumption 3. This procedure is described in Section 5. Phase 3: Estimating Low Rank Reward Matrix For this, we use the active learning procedure given in Section 6 via row-wise linear measurements along with the policy Π MC,h = Πf,h , which was computed in Phase 2. Phase 4: Computing the Optimal Policy Once the reward matrix Θ h have been reconstructed for every h in Phase 3, we use the output of reward free RL in order to compute the ϵ optimal policy for each user.

4.1. TABULAR MDP:

The standard assumption for low rank matrix completion is that of incoherence, which ensures that the matrix is not too sparse so that sparse measurements are sufficient to learn it. The following definition is used in Recht (2011) . Definition 1. Given a r dimensional sub-space U of R n , we define the coherence of as: µ(U ) := n r sup 1≤i≤n ∥P U e i ∥ 2 . A n 1 × n 2 matrix M with singular value decomposition U ΣV ⊺ is called (µ 0 , µ 1 ) coherent if: Input: Total time T ; Tolerance γ; lower isometry constant κ Output: ϕ ht , S (h+1)t for h ∈ [H -1], t ∈ [T ] h ← 1 ; while h ≤ H -1 do Q ← I ; ΠQ ← Π(Q h,Q ) ; G ϕ,h ← 0 ; // Grammian initialized to 0 t ← 1 ; while t ≤ T do U t ∼ Unif([N ]) ; // Pick a uniformly random user S 1:H , A 1:H ∼ M Ut ( ΠQ ) ; // Obtain Trajectory ϕ ht ← ϕ(S h , A h ) ; S (h+1)t ← S h+1 ; G ϕ,h ← G ϕ,h + ϕ ht ϕ ⊺ ht ; // Update Grammian if ∃y ∈ S d-1 : y ⊺ G ϕ,h y < κ 2 then Q ← the eigenspace of G ϕ,h with eigenvalues < κ 2 I; ΠQ ← Π(Q h,Q ) ; end t ← t + 1 end h ← h + 1 end Algorithm 2: Well Conditioned Matrix Sampler a) The coherence of the row and column spaces of M are at-most µ 0 b) The absolute value of every entry of U V ⊺ is bounded above by µ 1 r n1n2 . Given a policy Π, and Ω ⊆ S × A, by P Π h (Ω) we denote the probability that at time h we have (S h , A h ) ∈ Ω under the policy Π. Assumption (Tab) 2. Given the reward matrix R h and Ω h ⊂ S × A, recall the notation for the sub-matrix R Ω h h of R h . If sup Π P Π (Ω ∁ h ) < ϵ have: 1. R Ω h h is (µ 0 , µ 1 ) incoherent. 2. |Ω ∁ h | ≤ |S| 2 The incoherence assumption for R Ω h makes sense since the set Ω ∁ cannot be easily reached with any policy with a probability larger than ϵ. In fact we can arrive at an ϵ optimal policy for the original reward by just setting the rewards at Ω ∁ to be 0. These can be thought of as redundant states which do not matter for our RL model with any reward. Theorem 1. Suppose Assumption (Tab) 1, 2 hold. Let the parameter p = C max(µ 2 1 ,µ0)r(N +|S||A|) log 2 |S||A| log( H δ ) N |S||A| . for some large enough constant C. Assume that |S||A| and N are large enough such that p < 1/2. Then, with probability at-least 1 -δ, we can find an ϵ optimal policy Πu for every user u ∈ [N ] whenever the total number of trajectories queried is: C |S||A|H 2 |S| + log( 1 δ ) ϵ 2 polylog( |S||A|H ϵ )+ C max(µ 2 1 , µ 0 )r(N + |S||A|)H log 2 |S||A| log( H δ ) ϵ Remark 1. For large N , the number of trajectories queried per user is Õ( rH log 2 (|S||A|) ϵ ), which is an exponential improvement in the state-space size dependence when compared to the minimax rate of |S||A|H 2 ϵ 2 (Dann & Brunskill, 2015) . Every phase in the algorithm has polynomial computational complexity in N, |S||A| and 1 ϵ . The probability p is chosen such that p|S||A|N = Õ(r(|S||A|+N )), which is the number of free parameters required to describe a rank r matrix.

4.2. LINEAR MDP

Assumption (Lin) 2. There exists a γ > 0 such that for every x ∈ S d-1 , and every h ∈ [H] there exists a policy Π x,h such that whenever S 1:H , A 1:H ∼ Π x,h , E⟨ϕ(S h , A h ), x⟩ 2 ≥ γ The assumption above shows that we can obtain information about all directions. If this does not hold for any γ, then ϕ(S h , A h ) does not have any component in some direction x 0 with any policy. Thus, we can remove the sub-space spanned by x 0 and make the embedding space R d-1 at time h. Assumption (Lin) 3. There exist ζ, ξ > 0 such that for every h ∈ [H], there exists a policy Π h,ζ,ξ ∈ Q such that whenever S 1:H , A 1:H ∼ M(Π h,ζ,ξ ), we have: inf x∈S d-1 E |⟨x, ψ(S h , A h )⟩| √ d -ξd⟨x, ψ(S h , A h )⟩ 2 ≥ ζ The assumption above ensures that there exist measurements ψ(S h , A h ) which are conducive to low rank matrix estimation as considered in Section 6. Assumption (Lin) 4. For any 1 ≥ η > 0, there exists an η net for Q, denoted by Qη such that log | Qη | ≤ D log( 1 η ). We refer to Section A, where we justify this assumption. We first demonstrate that deterministic policies which are sufficient for reward maximization (as used in Jin et al. (2020b) ) cannot be used in this context, so a set of stochastic policies is required. We then construct such policy classes with D = O(dH log dH log log(|A|)). Theorem 2. Suppose Assumptions (Lin) 1 2 3 and 4 hold and suppose ϵ < γ 2 . In Algorithm 2, we set κ = CCµdH √ dH+D( √ d+ξd) ζ log CµH(d+D) ζγδ and T = C κ 2 d γ 2 log dκ γ . Then, with probability at least 1 -δ, our algorithm finds ϵ optimal policy for every user u ∈ [N ] with the total number of trajectories being bounded by: T rf + T pol + T mat-comp , where: T rf = dH 5 (d+log( 1 δ )) ϵ 2 + d 9/2 H 6 log 4 ( 1 δ ) ϵ , T pol = C 2 µ d 5 H 3 (dH+D) log 2 CµH(d+D) ζγδ ζ 2 γ 2 T mat-comp = C Hr(N +d log N ) log d ζξ +H log N log( log N δ ) ζ 2 ξ 2 Remark 2. Note that whenever N is large, we have the the sample complexity per user to be O(Hr). This is much better than the dependence of Ω(d 2 H 2 ) in the minimax lower bounds for Linear MDPs (Wagenmaker et al., 2022) . While Phases 1 and 2 of the algorithm have a computational complexity which is polynomial in d and 1 ϵ , the optimization problems posed in Phase 3 and 4 are not necessarily polynomial time. We leave the computational aspects to future work. Notice that the sample complexity Õ(r(N + d)) corresponds to the number of free parameters required to describe a rank r matrix.

5. OBTAINING POLICIES WITH GIVEN STATISTICS

In this section, we consider the Linear MDP setting and describe the sub-routine described in Step 2 of Phase 2 of the algorithm where we compute a policy Πf,h such that the law of ϕ(S h , A h ) under this policy approximately satisfies the property given in Assumption 3. This is required in order to use the guarantees for low matrix estimation in Phase 3, which is described in Section 6. We first state a structural lemma which characterizes the law of S h+1 , A h+1 under any policy Π. Lemma 1. Consider any policy Π = (π 1 , . . . , π H-1 , π H ) to the MDP M. Let S 1:H , A 1:H ∼ M(Π). Then for any bounded, measurable function g : S × A → R, we have: Eg(S h+1 , A h+1 ) = d i=1 ν i g(s, a)dµ ih (ds)π h+1 (da|s) Where ν i := ⟨Eϕ(S h , A h ), e i ⟩ We now want to estimate certain statistics under any policy using available data, obtained from the output of Algorithm 2. Notice that the output of Algorithm 2 gives a sequence of random variables (ϕ h1 , s (h+1)1 ), . . . , (ϕ hT , s (h+1)T ) ∈ R d × S such that (s (h+1)l ) T l=1 |(ϕ hl ) T l=1 ∼ T l=1 d i=1 ⟨ϕ hl , e i ⟩µ hi (•) and G ϕ,h := T t=1 ϕ ht ϕ ⊺ ht . For any measurable function g : S ×A → R K , ν ∈ R d such that ∥ν∥ ≤ 1 and any randomized policy Π = (π 1 , . . . , π H ) we define: 1. T 1 (g, π 1 ) = E g(S 1 , a)π 1 (da|S 1 ) 2. T h+1 (g; ν, π h+1 ) = d i=1 ⟨ν, e i ⟩ µ ih (ds)π h+1 (da|s)g(s, a) when h ≤ H -1 3. E ν 1 (Π) := ∥T 1 (ϕ, π 1 ) -ν∥ 1 4. E ν h (Π) = inf ν1,...,ν h-1 ∈B d (1) F (Π, ν 1 , . . . , ν h-1 , ν) whenever h > 1 Define α ht,ν = ϕ ⊺ ht G -1 ϕ,h ν. We estimate these operators from data as follows: 1. T1 (g, π 1 ) = 1 T T t=1 g(s 1t , a)π 1 (da|s 1t ) 2. Th+1 (g, ν, π h+1 ) = T t=1 α ht,ν g(s (h+1)t , a)π h+1 (da|s (h+1)t ) 3. Êν 1 (Π) := ∥ T1 (ϕ, π 1 ) -ν∥ 1 4. Êν h (Π) = inf ν1,...,ν h-1 ∈B d (1) F (Π, ν 1 , . . . , ν h-1 , ν) whenever h > 1 Where, for h > 1 and ν 1 , . . . , ν h-1 ∈ B d (1), we have defined: 1. F (Π, ν 1 , . . . , ν h-1 , ν h ) := E ν1 1 (Π) + h j=2 ∥T j (ϕ, ν j-1 , π j ) -ν j ∥ 1 2. F (Π, ν 1 , . . . , ν h-1 , ν h ) := Êν1 1 (Π) + h j=2 ∥ Tj (ϕ, ν j-1 , π j ) -ν j ∥ 1 Define f (s, a; x) := |⟨x, ψ(s, a)⟩| √ d -ξd⟨x, ψ(s, a)⟩ 2 . The output of our method is: 1. Πf,1 = arg sup Π=(π1,...,π H )∈Q inf x∈S d-1 T1 (f (•; x), π 1 ) 2. ( Πf,h , ν) = arg sup ν∈B(1) Π=(π1,...,π H )∈Q inf x∈S d-1 Th (f (•; x); ν, π h ) whenever h > 1, subject to Ê ν,h-1 ( Πf,h ) ≤ η 0 3. Assign output: Πζ,ξ,h = Πf,h The idea behind this method is as follows. First, using the output of algorithm 2, we construct Th (g, ν, π h ), which approximates the functional T h (g, ν, π h ) uniformly for every ν, π h . This is shown in Lemma 9 in the appendix. We will show in Theorem 3 that obtaining policies which can be used with the matrix completion routine reduces to picking a policy Π f,h = (π 1 , . . . , π H ) such that whenever S 1:H , A 1:H ∼ M(Π f,h ), we must have: E inf x∈S d-1 f (S h , A h ; x) ≥ ζ. Now, we use Lemma 1 to conclude that if such a policy exists, then there exist ν 1 , . . . , ν h-1 such that Eϕ(S j , A j ) = ν j and inf x∈S d-1 T h (f (; x); ν h , π h ) ≥ ζ. Since we only have sample access, we find such a policy approximately by optimizing using the estimates T instead of the exact functional T as described above. Theorem 3. We condition on the event G ϕ,h ≥ κ 2 I for every h ∈ [H]. Let κ, η, η 0 be such that for some small enough constants c 0 , c > 0 and a large enough constant C > 0: 1. η ≤ c ζ CµdH( √ d+ξd)H κ 2 T ; η 0 = c 0 ζ Cµ( √ d+dξ) 2. κ ≥ C Cµ( √ d+ξd)dH ζ log dH| Qη| δ + dH log d η Recall the policy Πf,h . Suppose the Assumption 3 holds. Then, with probability at-least 1 -δ 4 we obtain the policy Πf,h is such that whenever S 1:H , A 1:H ∼ M( Πf,h ), we have: E inf x∈S d-1 f (S h , A h ; x) ≥ ζ 2 This implies that ψ(S h , A h ) satisfies E|⟨ψ(S h , A h ), x⟩| ≥ ζ 2 √ d ; Eψ(S h , A h )ψ ⊺ ik ≤ 1 dξ 2

6. MATRIX ESTIMATION WITH ROW-WISE LINEAR MEASUREMENTS

In this section, we describe the active learning based low rank matrix estimation procedure. For an unknown rank r matrix Θ * (corresponding to the reward matrix Θ * h in the definition of Linear MDPs) of dimensions N × d, we are allowed to query samples of the form (e i , ψ, e ⊺ i Θ * ψ) for any i ∈ [N ] of our choice and ψ = ψ(S h , A h ) where S 1:H , A 1:H ∼ M(Π MC,h ), for some input policy Π MC,h . This corresponds to running the MDP corresponding to user i, with the policy Π MC,h and observing the reward at time h, given by ⟨e i , Θ * h Ψ(S h , A h )⟩. Our basic task is to estimate the matrix Θ * from these samples with high-probability.

6.1. THE ESTIMATOR

Given any N × d matrix ∆, by ∆ ⊺ i , we denote its i-th row. Given K ∈ N, and a sequence of vectors Ψ = (ψ ik ∈ R d ) i∈[N ],k∈[K] . L(∆, Ψ) := 1 N K N i=1 K k=1 |⟨∆ i , ψ ik ⟩| 2 We estimate Θ * row-wise using the following iterative procedure, where recover some rows of Θ * into Θ in each iteration and obtain the corresponding linear measurements of Θ * . Letting the set of unknown rows at iteration t to be Īt-1 (with Ī0 = [N ]). We draw a fresh sequence of vectors Ψ (t) from some distribution, we then recover some rows Ī∁ t ⊆ Īt-1 of Θ * and store them in Θ. 1. Draw Ψ (t) = (ψ (t) ik ) k∈[Kt],i∈ Īt-1 , we obtain θ * ik = e ⊺ i Θ * ψ (t) ik . 2. Consider the loss function L(Θ -Θ * , Ψ (t) ) := 1 K t | Īt-1 | i∈ Īt-1 K k=1 |⟨Θ i , ψ (t) ik ⟩ -θ * ik | 2 . 3. Find a matrix Θ with rank ≤ r such that L(Θ -Θ * , Ψ (t) ) = 0. 4. Initialize Īt ← ∅.

5.. For every

i ∈ Īt-1 , draw K fresh samples using ψ(t) i1 , • • • , ψ(t) iK and compute K k=1 |⟨Θ i , ψ(t) ik ⟩-θ * ik | 2 . If K k=1 |⟨Θ i , ψ(t) ik ⟩-θ * ik | 2 > 0 then add i to Īt i.e., Īt ← Īt ∪{i}. 6. End routine when Īt = ∅. Suppose ψ ik are i.i.d random vectors such that there exist ζ, ξ > 0 such that for any x ∈ R d , ∥x∥ = 1 we have: ∥ψ ik ∥ ≤ 1 almost surely; E|⟨ψ ik , x⟩| ≥ ζ √ d ; Eψ ik ψ ⊺ ik ≤ 1 dξ 2 (2) To give some intuition, the second condition above means that given any vector x, there is some overlap between the random vector ψ and x, ensuring that every measurement gives us some information helping us to complete the matrix. The third assumption is a standard bound on the covariance matrix. Then we have the following theorem whose proof is presented in Section F. Theorem 4. Assume that sup i ∥Θ * i ∥ ≤ C θ and that the distribution of ψ (t) ik satisfies equation 2. Suppose K t | Īt-1 | = C r| Īt-1|+dr ζ 2 ξ 2 log d ζξ +C log( log N δ ) ζ 2 ξ 2 . With probability at-least 1-δ, the algorithm terminates after log N iterations and the output Θ satisfies Θ = Θ * . Therefore, with probability at-least 1 -δ, the sample complexity for estimation of Θ * is: We will first show that randomized policies might be necessary in such contexts with a simple example and show that obtaining states which satisfy conditions like equation 2 goes beyond simple reward maximization. Suppose H = 1, S = {1} and A = {1, . . . , d}. We consider the embedding ψ(s, a) = e a . Suppose we want to obtain a policy π such that whenever S 1 , A 1 ∼ π, λ min (Eψ(S 1 , A 1 )ψ(S 1 , A 1 ) ⊺ ) is maximized (where λ min denotes the minimum eigenvalue). This is maximized when π(da|s) is chosen to be the uniform distribution over A and the corresponding value is 1/d. Note that whenever π is a deterministic policy we will have λ min = 0 whenever d > 1. This is in contrast to reward maximization problems where, under general conditions, a deterministic optimal policy exists (See Theorem 1.7 in Agarwal et al. (2019) ). C r(N + d log N ) ζ 2 ξ 2 log d ζξ + C log N log log N δ ζ 2 ξ 2 Yee Teh, If fact, we can also show that the policy which minimizes ∥Eψ(S 1 , A 1 ) -1 d d a=1 e a ∥ must also necessarily be random. In the case of linear MDPs, we can find such a deterministic optimal policy Π = (π 1 , . . . , π H ) as π h (s) = arg sup a ⟨ψ(s, a), u * h ⟩ + ⟨ϕ(s, a), v * h ⟩ (Jin et al., 2020b) . This reduces the problem to estimating the parameters u * h , v * h even when the state-action space is an infinite set. However, when such policies are not guaranteed to exist, as in case of functional maximization required in Section 6, the set of all policies can be intractably large. This is the justification for picking a nice enough policy space denoted by Q.

A.2 CONSTRUCTING POLICY SPACES

We consider any linear MDP satisfying the definition given in Section 2 and suppose A is finite. We consider the set of all probability distributions π h (a|s; u, v) ∝ exp(⟨ϕ(s, a), u⟩ + ⟨ψ(s, a), v⟩). We consider Q h = {π h (a|s, u, v) : u, v ∈ B d (R)}, We let our policy space be Q = {Π = (π 1 , . . . , π H ) : π h ∈ Q h }. Lemma 2. Consider the probability distribution over a finite set [|A|] give by p β (a) ∝ exp(βx a ) for every a ∈ [|A|] some x a ∈ R + and β ∈ R + . For any ϵ > 0 and random variable A ∼ p β , we must have: P(x A < sup a x a -ϵ) ≤ |A| exp(-βϵ) And Ex A ≥ (sup a x a -ϵ)(1 -|A| exp(-βϵ)) Lemma 3. Let Q * h (s, a) be the optimal action-value function for the MDP. Then the policy Π = (π 1 , . . . , π h ) given by π h (a|s) ∝ exp(βQ * h (s, a)) is ϵH + H 2 |A| exp(-βϵ) sub-optimal for any ϵ > 0 Proof. Consider the optimal value function defined by V * h (s) = sup a Q * h (s, a). Let Qh (s, a) denote the optimal action value function under the policy Π and let V (s) = Qh (s, a)π h (da|s) denote the value at state s with the policy Π. Clearly, we have: QH (s, a) = Q * H (s, a) = R(s, a). Qh (s, a) ≥ Q * h (s, a) -η uniformly. Then we have Vh (s) ≥ Q * h (s, a)π h (da|s) -η ≥ (sup a Q * h (s, a) -ϵ)(1 -|A| exp(-βϵ)) -η ≥ sup a Q * h (s, a) -ϵ -H|A| exp(-βϵ) -η = V * h (s) -ϵ -H|A| exp(-βϵ) -η (3) In the second step, we have invoked Lemma 2. In the last step, we have used the fact that Q * h ∈ [0, H] uniformly. Now, by the Bellman iteration, we have: Q * h-1 (s, a) = R h-1 (s, a) + E s ′ ∼P h-1 (|s,a) V * h (s ′ ) ≥ R h-1 (s, a) + E s ′ ∼P h-1 (|s,a) Vh (s ′ ) -ϵ -H|A| exp(-βϵ) -η = Qh-1 (s, a) -ϵ -H|A| exp(-βϵ) -η Therefore, by induction, we conclude that V1 (s) ≥ V * 1 (s) -ϵH -H 2 |A| exp(-βϵ) Therefore, by the definition of the value function, we conclude the claim. Now, by a simple extension of Proposition 2.3 in Jin et al. (2020b) , we conclude that the optimal Q * h function for any linear MDP can be written as: Q * h (s, a) = ⟨ψ(s, a), u * h ⟩ + ⟨ϕ(s, a), v * h ⟩ . Where ∥u * h ∥ 2 ≤ √ d and ∥v * h ∥ ∞ ≤ HC µ . Observe that choosing ϵ = η 2H and β = 2 log(2H|A|/η) η will ensure that the randomized policy Π in the statement of Lemma 3 is η optimal. Therefore, we can take R = 2dHC µ log(2H|A|/η) η in the definition of Q h above and conclude that this includes every η optimal policy for every MDP with embedding functions ϕ, ψ. We will now bound the covering number. Recall the definition of the distance D Q (Π 1 , Π 2 ) = sup h∈[H] TV(π (1) h , π (2) h ). Therefore it is sufficient to obtain an η cover for Q h (denoted by Qh,η ) and then construct Qη = {Π = (π 1 , . . . , π H ) : π h ∈ Qh,η ∀h ∈ [H]} = H h=1 Qh,η . Lemma 4. π(|s; u, v) be as defined in the beginning of this Subsection. TV(π(•|s; u, v), π(•|s; u ′ , v ′ )) ≤ 1 2 (exp(2∥u -u ′ ∥ 2 + 2∥v -v ′ ∥ ∞ ) -1) Proof. Denote π(a|s; u, v) by π(a) and π(a|s; u ′ , v ′ ) by π ′ (a). Consider the corresponding partition functions denoted by Z := a∈A exp(⟨ψ(s, a), u⟩ + ⟨ϕ(s, a), v⟩) and Z ′ := a∈A exp(⟨ψ(s, a), u ′ ⟩ + ⟨ϕ(s, a), v ′ ⟩). We conclude that using Hölder's inequality for ⟨u -u ′ , ψ⟩ and ⟨v -v ′ , ϕ⟩ that: exp(-∥u -u ′ ∥ 2 -∥v -v ′ ∥ ∞ ) ≤ Z ′ Z ≤ exp(∥u -u ′ ∥ 2 + ∥v -v ′ ∥ ∞ ) exp(-2∥u -u ′ ∥ 2 -2∥v -v ′ ∥ ∞ ) ≤ π ′ (a) π(a) ≤ exp(2∥u -u ′ ∥ 2 + 2∥v -v ′ ∥ ∞ ) (5) TV(π, π ′ ) = 1 2 a∈A |π(a) -π ′ (a)| = 1 2 a∈A π(a) 1 -π ′ (a) π(a) ≤ 1 2 (exp(2∥u -u ′ ∥ 2 + 2∥v -v ′ ∥ ∞ ) -1) Using the lemma above, we conclude that Qh,η = {π h (|s; u, v) : u, v ∈ Bd,η/4 (R)} whenever η ≤ 1. Here Bd,η/4 (R) an η/4 net over B d (R) with respect to the norm ∥ • ∥ 2 . From the results in Vershynin (2018), we can therefore take: | Qh,η | ≤ | Bd,η/4 (R)| 2 ≤ exp(Cd log( Cη R )) Since we had Qη = H h=1 , we conclude that: log(| Qη |) ≤ cdH log dH η + log log(2H|A|/η) A.3 RELATIONSHIP TO MEANFIELD LIMITS OF MULTI-AGENT RL Consider the conditions given in equation 2 for ψ(S h , A h ) where S 1:H , A 1:H ∼ Π mat-comp,h . We refer to the proof of Theorem 3 to show that the following condition implies the conditions given in equation 2: inf x∈S d-1 E|⟨ψ(S h , A h ), x⟩| √ d -ξd⟨ψ(S h , A h )⟩ 2 ≥ ζ Conversely, the conditions in equation 2 implies the following: inf x E|⟨ψ(S h , A h ), x⟩| √ d - ξ 2 ζ 2 d⟨ψ(S h , A h ), x⟩ 2 ≥ ζ 2 Therefore, the problem of obtaining a policy satisfying equation 2 reduces to finding a policy such that the functional inf x E|⟨ψ(S h , A h ), x⟩| √ d -βd⟨ψ(S h , A h ), x⟩ 2 is maximized for some small enough β ∈ R + . Note that this functional maps the distribution of (S h , A h ) (denoted by Γ h (Π)) obtained by applying some policy Π to a real number. Let us denote this function by J(Γ h ). Our objective now is to find a policy Π mat-comp,h by solving the following optimization problem arg sup Π∈Q J(Γ h (Π)) This setup is similar to the mean field multi-agent control problem presented in Cammardella et al. (2020) . To explicitly see the connection to Multi-agent systems, consider n agents with the same MDP M and embedding functions ϕ, ψ. Each trajectory from this multi-agent system corresponds to jointly and running the MDP associated with each agent with the same policy independently. The collective reward of the system is given by J( Γh ), where Γh denotes the empirical distribution of state-actions of the n agents at time h. Note that on the one hand, picking a policy Π n to maximize this reward is a reward maximization problem on the joint multi-agent system. And, for any fixed policy Π, as n → ∞, Γh (Π) → Γ h (Π) under reasonable assumptions on the state space via the law of large numbers and hence J( Γh (Π)) → J(Γ h (Π)) under continuity. Therefore the planning problem in equation 8 is the same as the multi-agent planning problem described above in the limit n → ∞.

B ANALYSIS -TABULAR MDPS

We will call the reward free RL procedure in Phase 1 to be successful if it outputs the ϵ optimal policy. This has probability atleast 1 -δ 2 .

B.1 ANALYSIS OF ALGORITHM 1

Lemma 5. Suppose p ≤ 1 2 , conditioned on the success of Phase 1, with probability at-least 1exp(-cN p|S||A|H), Algorithm 1 terminates after querying C|S||A|N Hp ϵ trajectories. (G h ) h∈[H] , the active sets at the termination of the algorithm. They satisfy: sup π H h=1 P π h (G h ) ≤ 5ϵ 8 For any a × b matrix R, let R be its partially observed version (that is, there exists a set of indices I ⊆ [a] × [b] such that Rij = R ij if (i, j) ∈ I and Rij = * otherwise). We call a random set of indices J to have the distribution Unif(m, [a], [b]) if J is drawn uniformly at random such that |J| = m. Lemma 6 (Modification: Mod1). Suppose we run, independently, a modification of algorithm 1 where on the "Query trajectory" step the trajectories are sampled from a fixed MDP M 1 (but rewards are from the reward function corresponding to U t ). Consider all the random variables that determine the trajectory of this algorithm: V , Π, (S 1:H , A 1:H , R 1:H ) t , (U t ) t . Then the joint distribution of this collection of random variables is unchanged under the modification. Proof. The proof follows from an induction argument on the time index t. We describe the key steps here. For the ease of notation, let T t = (S (t) 1:H , A (t) 1:H , R (t) 1:H ), U t Let X T = V , Π, (T t ) t≤T . Let XT and Tt denote the corresponding quantities under the modification. It is enough to show that finite dimensional marginals have the same joint distribution under the modification. In particular, we will show: 1. T 0 d = T0 2. Suppose X T d = XT . Then the Markov kernel k T T +1 |X T is almost surely (under the com- mon distribution of X T , XT ) equal to k TT +1 | XT . Thus X T +1 d = XT +1 The first statement is straightforward since, in the zeroth step, the distribution of ΠG not affected by the modification, and thus due to identical MDP transitions across users, the distribution of T 0 is preserved under modification. A similar argument proves the second statement. Roughly, given a realization of X T the distribution of T T +1 is same as the distribution of TT +1 given the same realization of XT , due to the exact same reason presented for the first statement. A fully formal proof requires setting up appropriate proability spaces, so we omit it here. Furthermore, since the random variables considered are all discrete, one can argue via PMFs as well. Lemma 7. Suppose p ≤ 1 2 . conditioned on the success of Phase 1 and termination of Algorithm 1, for every h ∈ [H], the Algorithm 1 returns partially filled reward matrices Rh . Con- sider the sub-matrix RG ∁ h h . Let I h ⊆ [N ] × G ∁ h be the sub-set of observed indices for Rh . Let J h |G h ∼ Unif( N p|G ∁ h | 2 , [N ], G ∁ h ). There exists a coupling between J h and I h such that P J h ⊆ I h G h ≥ 1 -|S||A| exp(-cN p) Proof. Let us fix G h and construct a coupling between I h and J h . Consider any fixed, arbitrary permutations From lemma 6 it is enough to prove the statement for the random variables under the modification described in that lemma (call this Mod1). Now consider a further modification (call it Mod2) where in every iteration t, we sample U t ∼ Unif([N ]), for each horizon h we set σ g over [N ], for g ∈ G ∁ h . By σ(I h ), we denote {(σ g (i), g) : (i, g) ∈ I h }. Ũ (t) h = σ (S (t) h ,A h ) (U t ), and then update the entries of R h ( Ũ (t) h , (S (t) h , A (t) h )) (instead of R h (U t , (S (t) h , A h ))). Next, we couple these two modifications by using same ( V , Π) and the same set of U t 's for both the modifications. Further, we couple the MDP used in these modifications to be the same, single MDP. Now an induction argument shows that the sequence of active sets G obtained in these modifications are also identical for every time t; only the rows of R h where entries are filled change according to the set of permutations chosen.Thus, under the described coupling, Mod1 and Mod2 produce identical trajectories (i.e., (S (t) 1:H , A (t) 1:H )), the columns of reward matrices are just permutations of each other described by the chosen set of permutations, and algorithm 1 terminate at the same time in both these cases. However, the same induction argument also shows that for each t and h, conditioned on G h , trajectories (which is same in Mod1 and Mod2) until at beginning of iteration t, we have (U t , (S (t) h , A (t) h )) d = ( Ũ (t) h , (S (t) h , A (t) h )). Therefore if I h , Ĩh ⊂ [N ] × G ∁ h denotes the subset of observed indices at termination (outside the active set), then Ĩh = σ(I h ) ≡ {(σ (s,a) (i), (s, a)) : (i, (s, a) ∈ I h )} and, conditioned on G h , Ĩh d = I h Claim 2. At termination, conditioned on G h , random sets I g h = {(i, g) : (i, g) ∈ I h } are jointly independent. Proof. Again we work with the modification Mod1 described in lemma 6. For each (s, a) consider the collection of U t 's that are used populate the column (s, a) of matrix R h in algorithm 1. Call this collection U (s,a) . Remark 3. Since the columns of I h have exactly N p entries, permutation invariance proved in the above claim implies that For any set J ⊆ [N ] × G ∁ h , define the count function (N J g ) g∈G ∁ h such that N J g = |{i ∈ [N ] : (i, g) ∈ J}|. We are now ready to give the coupling: given G h , draw uniformly random, independent permutations σ g for g ∈ G ∁ h . Draw (N g ) g∈G ∁ h independent of σ g and to have the joint law of (N J h g ) g∈G ∁ h . Define: Jh = {(σ g (i), g) : i ≤ N g , g ∈ G ∁ h } Ĩh = {(σ g (i), g) : i ≤ N p, g ∈ G ∁ h } Claim 3. The marginal distributions of Jh and Ĩh are respectively the distributions of J h and I h . Proof. First we will prove a general statement about J ∼ Unif(r, [N ], [M ]). Let X ∈ {0, 1} N ×M with X i,m = 1 iff (i, m) ∈ J. Let (N m ) m∈[M ] be the count functions corresponding to J i.e., N m = i X i,m . Let Y m = (X 1,m , • • • , X N,m ). We will argue that conditional on {N m : m ∈ [M ]}, the random vectors Y m are jointly independent. Indeed, pick any x ∈ {0, 1} N ×M and (n 1 , • • • , n m ). Let y m be the m'th column of x. Then P [X = x, ∩ m {N m = n m }] = m 1 i x i,m = n m 1   i,m x i,m = r   1 M N r The above can also be written as P [∩ m {Y m = y m }, ∩ m {N m = n m }] = m 1 i x i,m = n m 1 m n m = r 1 M N r Let 1 denote the all 1 vector in R N . Note that y m = (x 1,m , • • • , x N,m ) ⊤ . Marginalizing the above, we see that P [∩ m {N m = n m }] = m 1 N nm 1 m n m = r 1 M N r Thus the conditional distribution can be expressed as P ∩ m {Y m = y m } ∩ m {N m = n m } = m 1 1 ⊤ y m = n m N nm 1 [ m n m = r] M N r Since the (conditional) joint PMF factors, it is an easy calculation to show the conditional independence i.e., P ∩ m {Y m = y m } ∩ m {N m = n m } = m P Y m = y m ∩ m {N m = n m } Furthermore, for any n 1 , • • • n m such that m n m = r, marginalization shows P [Y m = y m | ∩ m {N m = n m }] = 1 1 ⊤ y m = n m N nm Let N -m = (N 1 , • • • , N m-1 , N m+1 , • • • , N M ) , and similarly for n -m . Then P [Y m = y m , N -m = n -m |N m = n m ] = P [Y m = y m | ∩ m {N m = n m }] P [N -m = n -m |N m = n m ] =    0, m n m ̸ = r 1[1 ⊤ ym=nm] ( N nm ) P [N -m = n -m |N m = n m ] , otherwise The above factorization directly implies that Y m , conditioned on N m is uniformly distributed on its support {y : 1 ⊤ y = N m } and is independent of N -m . Thus P ∩ m {Y m = y m } ∩ m {N m = n m } = m P Y m = y m N m = n m = m 1 1 ⊤ y m = n m N nm Observation: The above calculations give another way to generate Y : first generate N 1 , • • • , N M from the right distribution, and then conditioned on N m generate each Y m uniformly such that 1 ⊤ Y m = N m . Next we apply the above calculations and observation to J = J h |G h ∼ Unif( N p|G ∁ h | 2 , [N ], G ∁ h ). For a uniformly random permutation σ on [N ], the set {σ(i) : 1 ≤ i ≤ k} is uniformly distributed on all k-sized subsets of [N ]. In the statement of the claim the permutations are chosen independently for each g ∈ G ∁ c . Thus from the above observation, we have J h d = Jh conditioned on G h . The claim about Ĩh follows directly from permutation invariance proved by claim 1. Claim 4. P(N g > N p|G h ) ≤ exp(-c 0 N p) for every g ∈ G ∁ h Proof. Throughout this proof, we will condition on the terminal active set G h . We will show this using the results on concentration with negative regression property as established in Proposition 29 in Dubhashi & Ranjan (1996) . N g = N i=1 1((i, g) ∈ Jh ). Now we will show that the collection X ig := 1((i, g) ∈ Jh ) for i ∈ [N ], g ∈ G ∁ h satisfy the negative regression property. By the definition of negative regression, we can conclude that the sub-collection (X ig ) i∈[N ] also satisfies this property for every g ∈ G ∁ h . Consider the partial order over binary vectors X ⪰ Y iff X l ≥ Y l for every l. The negative regression property is satisfied iff for every K 1 , K 2 ⊆ [N ] × G ∁ h such that K 1 ∩ K 2 = ∅, and a real valued function f (X m : m ∈ K 1 ) which is non-decreasing with respect to the partial order, we must have: g(t l : l ∈ K 2 ) := E f (X m : m ∈ K 1 ) X l = t l , ∀l ∈ K 2 be such that g is a non-increasing function in t l with respect to the partial order. Note that in the case of uniform distribution as in Jh , the distribution (X m ) m∈K1 is the uniform, permutation invariant distribution with constant sum almost surely. The sum being N p|G ∁ h | 2 -l∈K2 t l . Therefore, whenever t ′ l ≥ t l for every l ∈ K 2 , we have the following stochastic dominance: (X m ) m∈K1 X l = t ′ l ∀l ∈ K 2 ⪯ (X m ) m∈K1 X l = t l ∀l ∈ K 2 Therefore, this coupling leads us to conclude that: g(t l : l ∈ K 2 ) := E f (X m : m ∈ K 1 ) X l = t l ∀l ∈ K 2 ≥ E f (X m : m ∈ K 1 ) X l = t ′ l ∀l ∈ K 2 = g(t ′ l : l ∈ K 2 ) ( ) The second step follows from stochastic dominance. This implies that the function g is nonincreasing which establishes the negative regression property. Now, we consult Proposition 29 in Dubhashi & Ranjan (1996) to show that we can take Chernoff bounds on N g = i∈[N ] X ig as though X ig were i.i.d Ber(p). Therefore, from an application of Bernstein's inequality (Boucheron et al., 2013) , we conclude the statement of the claim. Now, J h ⊆ I h if and only if N g ≤ N p for every g ∈ G ∁ h . Therefore, from the claim above, we have P(J h ⊆ I h |G ∁ h ) ≥ 1 -|S||A| exp (-c 0 N p). We are now ready to prove Theorem 1. Proof of Theorem 1. In order to establish the result, we need to show that with p as set in the statement, the algorithm returns ϵ optimal policies Πu for every user u ∈ [N ] with probability at-least 1 -δ. The total sample complexity is the number of trajectories queried in Phase 1 plus the number of trajectories queried in Phase 2. Phase 1 queries K rf ( ϵ 8 , δ 2 ) trajectories, which is C |S||A|H 2 |S|+log( 1 δ ) ϵ 2 polylog( |S||A|H ϵ ) by the results of Zhang et al. (2020) . By Lemma 5, we conclude that the sample complexity of phase 2 is C|S||A|N Hp ϵ and with the value of p given in the statement of the theorem, this succeeds with probability at-least 1 -δ 4 when conditioned on the success of Phase 1. We will show that conditioned on the success of Phase 2, with probability at-least 1 -δ 4 , the nuclear norm minimization algorithm of Recht (2011) successfully obtains R G ∁ h h . Indeed by Theorem 1 in Recht (2011) , we see that whenever co-ordinates of R G ∁ h h corresponding to random indices drawn from Unif(m, [N ], G ∁ h ) are observed with m = C 1 max(µ 2 1 , µ 0 )r(N + |G ∁ h |) log 2 |G ∁ h | log( H δ ), the algorithm succeeds at recovering R G ∁ h h with probability at-least 1 -δ 8H . The number of co-ordinates we observe is N p|G ∁ h | ≥ N p|SA| 2 ≥ 2C 1 max(µ 2 1 , µ 0 )r(N + |G ∁ h |) log 2 |G ∁ h | log( H δ ) In the last step, we have used Assumption 2 to conclude that |G ∁ h | ≥ |S||A| 2 . For the constant C in the definition of p large enough, we must have: m ≤ N p|G ∁ h | 2 Note that the results of Recht (2011) requires at-least m observations to be chosen uniformly at random co-ordinates, but we do not obtain observations which are uniformly at uniformly random co-ordinates. Here, we will use the results of Lemma 7. Let J h be a fictitious subset of co-ordinates distributed as Unif(m, [N ], G ∁ h ) when conditioned on G ∁ h . If the observed co-ordinates are J h , then we can successfully estimate the reward matrix R h with proability at-least 1 -δ 8H in this case. Now, suppose that the actually observed co-ordinates are I h , which is a strict super-set of J h . Then we check that the matrix completion algorithm, which is based on constrained nuclear-norm minimization, still succeeds with observed co-ordinates corresponding to I h whenever it succeeds with the observed co-ordinates correspond to J h . We now refer to the coupling in Lemma 7, which shows that we can couple J h to the real distribution I h such that J h ⊆ I h with probability at-least 1 -δ 8H When the constant C 1 in the definition of p is large enough, we conclude by invoking Lemma 7 that: J h ⊆ I h with probability at-least 1 -δ 8H . Applying union bound for h ∈ [H], we conclude that Phase 3 succeeds with probability at-least 1 -δ 4 when conditioned on the success of Phases 1 and 2. Therefore, from the arguments above, we conclude that Phases 1,2 3 succeed with probability at-least 1 -δ and give us the reward matrices R G ∁ h h where the sets satisfy the following equation from Lemma 5. sup π H h=1 P π h (G h ) ≤ 5ϵ 8 It now remains to show that we obtain ϵ optimal policies for each user after Phase 4. Note that whenever Phase 1 succeeds, we can compute ϵ/4 optimal policies for every possible reward function bounded in [0, 1]. Since we do not know the rewards over the set G h , we set it to zero as described in the algorithm to obtain Rh . It remains to show that planning with Rh and using it with the reward free RL algorithm gives us an ϵ optimal policy. Suppose Π * u is the optimal policy for user u and suppose Πu be the optimal policy for user u under rewards Rh (u, (s, a)). Note that combined with the guarantees for the reward free RL, in order to complete the proof of the theorem, it is sufficient to show that the policy Πu is 3ϵ/4 optimal with respect to the actual rewards R h (u, (s, a)). Let S * 1:H , A * 1:H ∼ M(Π * u ) and S1:H , Ā1:H ∼ M( Πu ). E H h=1 R h (u, (S * h , A * h )) ≤ E H h=1 R h (u, (S * h , A * h ))1((S * h , A * h ) ∈ G ∁ h ) + 1((S * h , A * h ) ∈ G h ) = E H h=1 Rh (u, (S * h , A * h )) + 1((S * h , A * h ) ∈ G h ) = E H h=1 Rh (u, (S * h , A * h )) + H h=1 P Π * h (G h ) ≤ E H h=1 Rh (u, (S * h , A * h ) + 5ϵ 8 ≤ E H h=1 Rh (u, ( Sh , Āh ) + 5ϵ 8 ≤ E H h=1 R h (u, ( Sh , Āh ) + 5ϵ 8 In the first step we have used the fact that the rewards are uniformly bounded in [0, 1]. In the second step, we have used the definition of Rh (u, (s, a)) := R h (u, (s, a))1((s, a) ∈ G ∁ h ). In the third step, we have used the guarantee in equation 11. In the fourth step, we have used the fact that P i maximizes the reward Rh . In the fifth step, we have used the fact that R h (u, (s, a)) ≥ Rh (u, (s, a)) uniformly. From the discussion above, this concludes the proof of the theorem.

C ANALYSIS -LINEAR MDPS

Lemma 8. Suppose Assumption 2 holds. Let κ > 1 and T ≥ C dκ 2 (γ-ϵ) 2 log dκ γ-ϵ . With probability atleast 1 -H exp(-c(γ -ϵ)T ), the output of Algorithm 2 returns ϕ th such that T t=1 ϕ th ϕ ⊺ th ⪰ κ 2 I for every h ∈ [H] Proof of Theorem 2. By Theorem 1 in Wagenmaker et al. ( 2022), we take K rf (ϵ, δ/4) = CdH 5 (d+log( 1 δ )) ϵ 2 + Cd 9/2 H 6 log 4 ( 1 δ ) ϵ . Phase 1 succeeds with probability 1 -δ 4 . Note that this is the quantity T rf in the statement of the theorem. We now condition on the success of Phase 1. The number of trajectories queried by Algorithm 2 which is given by HT = T pol . By Lemma 8, we conclude that for the given values of T and κ, this algorithm successfully outputs ϕ ht such that G ϕ,h ⪰ κ 2 I for every h ∈ [H], with probability at-least 1 -δ 4 . Now, condition on the success of Algorithm 2. By theorem 3, we conclude that with these conclude that with proabability at-least 1 -δ 4 , with the values of the given parameters, for every h ∈ [H], the procedure in Step 2 of Phase 2 outputs a policy Πf,h such that whenever S 1:H , A 1:H ∼ M( Πf,h ), the conditions in equation 2 is satisfied for the random vector ψ(S h , A h ) with ζ replaced by ζ/2. We then use the active learning based matrix completion procedure given in Section 6, where the vectors ψ jk are sample using the policy Πf,h on the given user. By theorem 4, we conclude that conditioned on the success of all the steps above, with probability 1 -δ 4 , we can exactly estimate each of the matrices Θ * h for h ∈ [H] with T mat-comp number of samples. Upon the success of Phases 1, 2, 3 (which occurs with probability at-least 1 -δ by union bound), we conclude that Phase 4 gives the ϵ optimal policy for each user u ∈ [N ] because of the guarantees of reward free RL.

D DEFERRED PROOFS D.1 PROOF OF LEMMA 5

Proof. We suppose that the reward free RL in Phase 1 succeeds and returns the ϵ 8 optimal policy for every choice of rewards bounded in [0, 1]. The algorithm terminates whenever the active sets are such that V (J (; G)) ≤ ϵ 2 Note that by the definition of J (; G), the maximum value for the MDP with reward J (; G) is sup Π H h=1 P Π (G h ). Since V is the output of the reward free RL algorithm, we conclude that we have: | V (J (; G)) -sup Π H h=1 P Π h (G h )| ≤ ϵ 8 We conclude via equation 13 and equation 14 that equation 9 holds, which establishes the second part of the theorem. We now consider the termination time. Suppose G (t) is the sequence of active sets before termination at step t (i.e, it satisfies V (J (; G (t) )) > ϵ 2 ). Recall Π, the output of the reward free RL algorithm. It follows from the guarantees for reward free RL that: | H h=1 P ΠG h (G (t) h ) -sup Π H h=1 P Π h (G (t) h )| ≤ ϵ 8 Combining this with Equation equation 14 and the fact that V (J (; G (t) )) > ϵ 2 , we conclude: H h=1 P ΠG h (G (t) h ) ≥ ϵ 4 We consider the potential function with φ(0) = 0 and φ(t) = h∈[H] (s,a)∈S×A T (t) h,(s,a) , where T (t) h,(s,a) is the counter T h,(s,a) inside Algorithm 1 at the beginning of the step t. Whenever G (t) is such that V (J (; G (t) )) > ϵ 2 , we define N t := φ(t + 1) -φ(t) (i.e, before termination). Just for the sake of theoretical arguments, we define the fictious random variables N t = Ber( ϵ 8 ) i.i.d after termination. Let F t = σ(G (s) , S 1:H , A 1:H , R 1:H , U (s) : s ≤ t) Claim 5. The following relations hold: 1. E [N t |F t ] ≥ ϵ 8 2. E N 2 t |F t ≤ E[Nt|Ft]H 4 3. |N t | ≤ H almost surely. Proof. The inequalities are clear when G (t) is such that V (J (; G (t) )) ≤ ϵ 2 . Now consider the case V (J (; G (t) )) > ϵ 2 . By definition, conditioned on this event, we have almost surely: N t H h=1 1((S (t) h , A (t) h ) ∈ G (t) h ).1( R(t) h (U t , (S (t) h , A (t) h )) = * ) That is, we increment the T h,(s,a) only when we encounter an element of the active set such that the entry for this user has not been observed before. Observe that for any arbitrary (s, a) ∈ S × A P R(t) h (U t , (s, a)) = * F t , (S h ) = (s, a) = |{u : R(t) h (s, a) = * }| N . This is true since the law of S (t) h , A h is independent of U t (since all users share the same MDP), when conditioned on F t . Now, the algorithm only fills the column corresponding to (s, a) until the number of entries is smaller than N p ≤ N 2 . We conclude that: |{u : R(t) (h, (s, a)) = * }| ≥ N -N p ≥ N 2 . This allows us to conclude P R(t) h (U t , (s, a)) = * F t , (S (t) h , A h ) = (s, a) ≥ 1 2 and hence: EN t = H h=1 E1((S (t) h , A (t) h ) ∈ G (t) h ).1(R (t) h (U t , (S (t) h , A (t) h )) = * ) ≥ 1 2 H h=1 E1((S (t) h , A (t) h ) ∈ G (t) h ) = 1 2 H h=1 P ΠG h (G (t) h ) ≥ ϵ 8 In the last step we have used equation 15. The bound |N t | ≤ H almost surely follows from defini- tion. Now note that E N 2 t |F t ≤ HE [N t |F t ]. Claim 6. For any τ ∈ N and some c 0 > 0 small enough, we have: P τ -1 t=0 N t < ϵτ 16 ≤ exp(-c 0 ϵτ H ) Proof. For 3 4H > λ > 0, consider M t = - λ 2 E[N 2 t |Ft] 1- λH 3 + λ (E [N t |F t ] -N t ). Now consider: E exp( τ -1 t=0 M t ) = E E [exp(M τ -1 )|F τ -1 ] exp τ -1 t=0 M t = E [exp(λE [N τ -1 |F τ -1 ] -λN t )|F τ -1 ] exp τ -2 t=0 M t exp - λ 2 E[N 2 τ -1 |Fτ-1] 1- λH 3 ≤ E exp - λ 2 E[N 2 τ -1 |Fτ-1] 1- λH 3 exp τ -2 t=0 M t exp - λ 2 E[N 2 τ -1 |Fτ-1] 1- λH 3 = E exp( τ -2 t=0 M t ) In the first step we have used the fact that τ -2 t=0 M t is F τ -1 measurable and the towering property of conditional expectation. In the third step, we have used the exponential moment bound given in Exercise 2.8.5 in Vershynin ( 2018), as applied to N τ -E [N τ -1 |F τ -1 ] along with the fact that N t ∈ [0, H] almost surely. From equation 18, we conclude that E exp( τ t=0 M t ) ≤ 1 and thus applying the Chernoff bound, we conclude that for any β > 0 P τ -1 t=0 - λE[N 2 t |Ft] 1- λH 3 + (E [N t |F t ] -N t ) > β ≤ exp(-λβ) Now, using item 2 from Claim 5, we conclude that P τ -1 t=0 N t < -β + 3-4λH 3-λH τ -1 t=0 E [N t |F t ] ≤ exp(-λβ) Now, using item 1 from Claim 5, we note that τ -1 t=0 E [N t |F t ] ≥ ϵτ 8 almost surely. Setting λ = 1 4H and β = c 0 ϵτ for some small enough constant ϵ, we conclude: P τ -1 t=0 N t < ϵτ 16 ≤ exp(-c 0 ϵτ H ) Let τ term denote the termination time for the algorithm. This is true since φ(t) is increasing in t, φ(t) ≤ N pH|S||A|, and strict inequality holds when t < τ term . For every τ < τ term we have φ(τ ) = τ -1 t=0 N τ < N pH|S||A|. Therefore, we have the following relationship between the events: {τ term > τ } ⊆ τ t=1 N τ < N p|S||A|H Setting τ = 16N p|S||A|H ϵ , we have: Let B th be the matrix I + A ϕ in Algorithm 2 at step t for horizon h. Similarly, let the corresponding projection Q be Q th . Recall that Q th is the projection onto an eigenspace of B th . Now, suppose S 1:H , A 1:H ∼ M Ut ( ΠQ t,h ) as in the algorithm. Let ϕ th := ϕ(S h , A h ). Now, if Q th ̸ = 0, then: P(τ term > τ ) ≤ P τ t=1 N τ < N p|S||A|H ≤ exp(-cN ϕ ⊺ th B -1 th ϕ th ≥ ϕ ⊺ th Q th B -1 th Q th ϕ th ≥ ϕ ⊺ th Q th I 1 + κ 2 Q th ϕ th = ∥Q th ϕ th ∥ 2 1 + κ 2 In the first step, we have used the fact that Q th is the projector to the eigenspace of B -1 th . In the second step, we have used the fact that over the eigenspace corresponding to Q th , the eigenvalues of B -1 th are at-least 1 1+κ 2 . We now invoke Assumption 2 in order to show that, along with the guarantees of reward free RL in phase 1, we conclude that: E ∥Q th ϕ th ∥ 2 |Q th ̸ = 0, B th ≥ γ -ϵ Now, note by the fact that Q th is a projector and that ∥ϕ th ∥ ≤ 1, we have: E ∥Q th ϕ th ∥ 4 Q th ̸ = 0, B th ≤ E ∥Q th ϕ th ∥ 2 Q th ̸ = 0, B th Recall the Paley-Zygmund inequality which states that for any positive random variable Z, we must have: P(Z > EZ 2 ) ≥ 1 4 (EZ) 2 EZ 2 . Therefore, P ϕ ⊺ th B -1 th ϕ th > γ -ϵ 2(1 + κ 2 ) Q th ̸ = 0, B th ≥ P ∥Q th ϕ th ∥ 2 > γ -ϵ 2 Q th ̸ = 0, B th ≥ P ∥Q th ϕ th ∥ 2 > 1 2 E ∥Q th ϕ th ∥ 2 Q th ̸ = 0, B th Q th ̸ = 0, B th ≥ 1 4 E ∥Q th ϕ th ∥ 2 Q th ̸ = 0, B th 2 E ∥Q th ϕ th ∥ 4 Q th ̸ = 0, B th ≥ 1 4 E ∥Q th ϕ th ∥ 2 Q th ̸ = 0, B th ≥ γ -ϵ 4 In the first step, we have used equation 19. In the second step, we have used equation 20. In the third step, we have used the Paley-Zygmund inequality and the moment bound in equation 21. Define the stopping time τ = inf{t ≤ T : Q th = 0} and τ = ∞ if the set in the RHS is empty. Let Ξ 0 t for t ∈ {0} ∪ N be a sequence of i.i.d random variables with the law γ-ϵ 2(1+κ 2 ) Ber( γ-ϵ 4 ). We consider the sequence of random variables Ξ t = ϕ ⊺ th B -1 th ϕ th for t < τ and Ξ t = Ξ 0 t for t ≥ τ Now, we apply the matrix determinant lemma which states that det(B + uu ⊺ ) = det(B)(1 + u ⊺ B -1 u). We note that B (t+1)h = B th + ϕ th ϕ ⊺ th . Therefore, whenever t < τ , we must have: det(B (t+1)h ) = det(B th )(1 + Ξ t ) Since ∥ϕ th ∥ ≤ 1 almost surely, we must have Tr(B th ) = d i=1 ⟨e i , B th e i ⟩ ≤ d + t It is easy to show that for any PSD matrix, A, if Tr(A) ≤ α, then det(A) ≤ ( α d ) d (since trace is the sum of the eigenvalues and the determinant is the product). Combining the equations above, we conclude that whenever t < τ , we must have: t + 1 + d d d ≥ t s=0 (1 + Ξ t ) Therefore, the event {τ > T } ⊆ { T +1+d d d ≥ T s=0 (1 + Ξ t )} (24) Claim 7. P   T s=0 (1 + Ξ t ) ≥ 1 + γ -ϵ 2(1 + κ 2 ) (γ-ϵ)T 8   ≥ 1 -exp (-c 0 T (γ -ϵ)) Let κ > 1 and T ≥ C dκ 2 (γ-ϵ) 2 log dκ γ-ϵ , we have: P T s=0 (1 + Ξ t ) ≥ T +1+d d d ≥ 1 -exp (-c 0 T (γ -ϵ)) Proof. Let N T be the number of variables (Ξ t ) T t=0 such that Ξ t ≥ γ-ϵ 2κ 2 . Then, it is clear that T s=0 (1 + Ξ t ) ≥ (1 + γ-ϵ 2κ 2 ) N T . Therefore, P   T s=0 (1 + Ξ t ) ≥ 1 + γ -ϵ 2(1 + κ 2 ) (γ-ϵ)T 8   ≥ P N T ≥ (γ -ϵ)T 8 ≥ P Bin(T, γ-ϵ 4 ) ≥ (γ -ϵ)T 8 ≥ 1 -exp (-c 0 T (γ -ϵ)) Here Bin refers to the law of a binomial random variable. The first step follows from the fact that T s=0 (1 + Ξ t ) ≥ (1 + γ-ϵ 2(1+κ 2 ) ) N T almost surely. The second step follows from equation 22, which shows that conditioned on Q th , B th , the random variable 1(Ξ t ≥ γ-ϵ 2(1+κ 2 ) ) stochastically dominates Ber( γ-ϵ 4 ). The last step follows from an application of Bernstein's inequality for binomial random variables. Now, using equation 24 along with Claim 7, we conclude: P(τ > T ) ≤ P T +1+d d d ≥ T s=0 (1 + Ξ t ) ≤ exp(-c 0 T (γ -ϵ)) D.3 PROOF OF LEMMA 1 Proof. By the definition of Linear MDP, we must have S h+1 |S h , A h ∼ d i=1 ⟨ϕ(S h , A h ), e i ⟩µ ih (•) and A h+1 |S h+1 ∼ π h+1 (•|S h ). Therefore, for any bounded, measurable function g : S × A → R, we must have: Eg(S h+1 , A h+1 ) = E E g(S h+1 , A h+1 ) S h , A h = E d i=1 ⟨ϕ(S h , A h ), e i ⟩ µ i(h-1) (ds)π h+1 (da|s)g(s, a) = d i=1 ν ih µ ih (ds)π h+1 (da|s)g(s, a) D.4 PROOF OF LEMMA 9 Proof. It is clear from the assumption that E g(s (h+1)t , a)π h+1 (da|s t )|(ϕ ht ) t≤T = d i=1 ⟨ϕ ht , e i ⟩ µ ih (ds)π h+1 (da|s)g(s, a). Note that T t=1 α ht,ν ϕ ht = T t=1 ϕ ⊺ ht G -1 ϕ,h νϕ ht = T t=1 ϕ ht ϕ ⊺ ht G -1 ϕ,h ν = ( T t=1 ϕ ht ϕ ⊺ ht )G -1 ϕ,h ν = G ϕ,h G -1 ϕ,h ν = ν (27) Therefore, E T (g; ν, π h )|(ϕ l ) t∈[T ] = d i=1 ⟨ T t=1 α t,ν ϕ t , e i ⟩ µ i(h-1) (ds)π h (da|s)g(s, a) = d i=1 ⟨ν, e i ⟩ µ i(h-1) (ds)π h (da|s)g(s, a) = T (g; ν, π h ) Note that, conditioned on (ϕ t ) t∈[T ] , α ht,ν g(s (h+1)t , a)π h+1 (da|s (h+1)t ) are independent random variables bounded above by α ht,ν B. Therefore, applying the Azuma-Hoeffding inequality, we conclude: P | T (g; ν, π h ) -T (g; ν, π h )| > β (ϕ t ) t∈[T ] ≤ 2 exp - t 2 2B 2 t α 2 t,ν Now, observe that t α 2 ht,ν = t ν ⊺ G -1 ϕ,h ν ≤ 1 κ 2 whenever G ϕ,h ⪰ κ 2 I This concludes the proof.

D.5 PROOF OF LEMMA 10

Proof. Notice that: E ν1 1 (Π) -E ν ′ 1 1 (Π ′ ) ≤ E ν1 1 (Π) -E ν1 1 (Π ′ ) + E ν1 1 (Π ′ ) -E ν ′ 1 1 (Π ′ ) ≤ E ν1 1 (Π) -E ν1 1 (Π ′ ) + ∥ν 1 -ν ′ 1 ∥ 1 ≤ E ϕ(S 1 , a)π 1 (da|S 1 ) -E ϕ(S 1 , a)π ′ 1 (da|S 1 ) 1 + ∥ν 1 -ν ′ 1 ∥ 1 ≤ sup (s,a) ∥ϕ(s, a)∥ 1 TV(π 1 , π ′ 1 ) + ∥ν 1 -ν ′ 1 ∥ 1 ≤ TV(π 1 , π ′ 1 ) + ∥ν 1 -ν ′ 1 ∥ 1 (29) In the first, second and third steps we have used the triangle inequality. In the last step, we have used the fact that for any bounded function, and any probability measures µ, ν, we have | f (x)µ(dx) - f (x)ν(dx)| ≤ sup x |f (x)|TV(ν, µ). Now consider: ∥T j (ϕ, ν j-1 , π j ) -ν j ∥ 1 -∥T j (ϕ, ν ′ j-1 , π ′ j ) -ν ′ j ∥ 1 ≤ ∥ν j -ν ′ j ∥ 1 + T j (ϕ, ν j-1 , π j ) -T j (ϕ, ν ′ j-1 , π ′ j ) 1 ≤ ∥ν j -ν ′ j ∥ 1 + T j (ϕ, ν j-1 , π j ) -T j (ϕ, ν ′ j-1 , π j ) 1 + T j (ϕ, ν ′ j-1 , π j ) -T j (ϕ, ν ′ j-1 , π ′ j ) 1 Now, observe that: T j (ϕ, ν j-1 , π j ) -T j (ϕ, ν ′ j-1 , π j ) 1 ≤ d i=1 |⟨ν j-1 -ν ′ j-1 , e i ⟩| ϕ(s, a)µ i(j-1) (ds)π j (da|s) 1 ≤ d i=1 |⟨ν j-1 -ν ′ j-1 , e i ⟩| = ∥ν j-1 -ν ′ j-1 ∥ 1 (31) Where we recall sup i,h,π ∥ ϕ(s, a)µ ih (ds)π(da|s)∥ 1 ≤ 1 as given in the definition of Linear MDP. Using the Hahn-Jordan decomposition of a signed measure, we conclude: T j (ϕ, ν ′ j-1 , π j ) -T j (ϕ, ν ′ j-1 , π j ) 1 ≤ d i=1 |⟨ν ′ j-1 , e i ⟩| ϕ(s, a)µ i(j-1) (ds)(π j (da|s) -π ′ j (da|s)) 1 ≤ d i=1 C µ |⟨ν ′ j-1 , e i ⟩|TV(π j , π ′ j ) ≤ C µ ∥ν ′ j-1 ∥ 1 TV(π j , π ′ j ) Combining equation 30, equation 31 and equation 32 we conclude: ∥T j (ϕ, ν j-1 , π j ) -ν j ∥ 1 -∥T j (ϕ, ν ′ j-1 , π ′ j ) -ν ′ j ∥ 1 ≤ ∥ν j -ν ′ j ∥ 1 + C µ ∥ν ′ j-1 ∥ 1 TV(π j , π ′ j ) + ∥ν j-1 -ν ′ j-1 ∥ 1 Combining equation 29 and equation 33, we conclude the first inequality in the statement of the lemma. With a reasoning very similar to that in equation 29, we have: Êν1 1 (Π) - Êν ′ 1 1 (Π ′ ) ≤ TV(π 1 , π ′ 1 ) + ∥ν 1 -ν ′ 1 ∥ 1 Using similar reasoning as in equation 33: ∥T j (ϕ, ν j-1 , π j ) -ν j ∥ 1 -∥T j (ϕ, ν ′ j-1 , π ′ j ) -ν ′ j ∥ 1 ≤ ∥ν j -ν ′ j ∥ 1 + T t=1 |(ν j-1 -ν ′ j-1 ) ⊺ G -1 ϕ,j-1 ϕ (j-1)t | + T t=1 |(ν ′ j-1 ) ⊺ G -1 ϕ,j-1 ϕ (j-1)t | TV(π j , π ′ j ) Now note that for any ν ∈ R d , we have: T t=1 ν ⊺ G -1 ϕ,j-1 ϕ (j-1)t ≤ T t ν ⊺ G -1 ϕ,j-1 ϕ (j-1)t 2 = T T t=1 ν ⊺ G -1 ϕ,j-1 ϕ (j-1)t ϕ ⊺ (j-1)t G -1 ϕ,j-1 ν = T ν ⊺ G -1 ϕ,j-1 ν ≤ T κ 2 ∥ν∥ 2 (36) Here, in the first step we have used the fact that whenever x ∈ R K , we must have ∥x∥ 1 ≤ √ K∥x∥ 2 . In the third step, we have used the fact that T t=1 ϕ (j-1)t ϕ ⊺ (j-1)t = G ϕ,j-1 by definition. In the last step, we have used the fact that Ĝϕ,j-1 ⪰ κ 2 I. Plugging this into equation 35, we conclude: ∥T j (ϕ, ν j-1 , π j ) -ν j ∥ 1 -∥T j (ϕ, ν ′ j-1 , π ′ j ) -ν ′ j ∥ 1 ≤ ∥ν j -ν ′ j ∥ 1 + T κ 2 ∥ν j-1 -ν ′ j-1 ∥ 2 + T κ 2 ∥ν ′ j-1 ∥ 2 TV(π j , π ′ j ) Using this and the definition of F we conclude the second inequality in the statement of the lemma. equation 44 and equation 45 follow from a similar reasoning.

D.6 PROOF OF LEMMA 11

Proof. First consider the case h = 1. Let g(s, a) := ϕ(s, a). In this case, sup ν∈B d (1) | Êν,1 (Π) -E ν,1 (Π)| ≤ ∥T 0 (g; π 1 ) -T0 (g; π 1 )∥ 1 . By equation 29 and equation 34, we conclude that π 1 → T 0 (ϕ; π 1 ) and π 1 → T0 (ϕ; π 1 ) are 1-Lipschitz with respect to TV() and ∥ • ∥ 1 . sup Π=π1,...,π H ∈Q ∥T 0 (g; π 1 ) -T0 (g; π 1 )∥ 1 ≤ sup Π=π1,...,π H ∈ Qη ∥T 0 (g; π 1 ) -T0 (g; π 1 )∥ 1 + 2η We apply Lemma 9 co-ordinate wise to the co-ordinates of ϕ and union bound over Qη . We have:  P sup Π=π1,...,π H ∈Q ∥T 0 (g; π 1 ) -T0 (g; π 1 )∥ 1 > 2η + dβ ≤ d| Qη | exp(-β 2 κ 2 2 ) |E ν,h (Π) -Êν,h (Π)| ≤ sup ν1,...,ν h ∈B d (1) Π∈Q | F (Π, ν 1 , . . . , ν h ) -F (Π, ν 1 , . . . , ν h )| ≤ sup ν1,...,ν h ∈ Bd,η Π∈ Qη | F (Π, ν 1 , . . . , ν h ) -F (Π, ν 1 , . . . , ν h )| + 2 1 + C µ + T κ 2 ηh (39) Now, by the triangle inequality, we have: | F (Π, ν 1 , . . . , ν h ) -F (Π, ν 1 , . . . , ν h )| ≤ ∥T 0 (ϕ, π 1 ) -T0 (ϕ, π 1 )∥ 1 + h-1 j=1 ∥T j (ϕ, ν j , π j+1 ) -Tj (ϕ, ν j , π j+1 )∥ 1 Therefore, by invoking Lemma 9, along with union bound over every component in the sum in equation 40 and over the net in equation 39 we conclude that: P     sup ν1,...,ν h ∈ Bd,η Π∈ Qη | F (Π, ν 1 , . . . , ν h ) -F (Π, ν 1 , . . . , ν h )| > βdh     ≤ 2dh| Qη || Bd,η | h exp(-β 2 κ 2 2 ) Combining equation 39 and equation 41, we conclude the second item in the statement of the lemma. The concentration of X 1 and X h follow in a similar fashion, but here we consider an η net even over x and use the Lipschitzness results given in Lemma 10 and the fact that x → f (ϕ; x) is 1 Lipschitz.

E PROOF OF THEOREM 3

Lemma 9. Suppose h ∈ [H -1], and g : S × A → R be such that |g(s, a)| ≤ B for every (s, a). For any policy π h and any ν such that ∥ν∥ 2 ≤ 1, we must have: P | Th (g; ν, π h ) -T h (g; ν, π h )| > β (ϕ ht ) t∈[T ] , G ϕ,h ⪰ κ 2 I ≤ 2 exp -β 2 κ 2 2B 2 Lemma 10. Let Π = (π 1 , . . . , π H ), Π ′ = (π ′ 1 , . . . , π ′ H ) be policies in Q. Conditioned on the event G ϕ,h ⪰ κ 2 I, the following hold: |F (Π, ν 1 , . . . , ν h-1 , ν h ) -F (Π ′ , ν ′ 1 , . . . , ν ′ h-1 , ν ′ h )| ≤   h j=2 C µ TV(π j , π ′ j )∥ν j ∥ 1 + 2∥ν j -ν ′ j ∥ 1   + TV(π 1 , π ′ 1 ) + 2∥ν 1 -ν ′ 1 ∥ 1 (42) | F (Π, ν 1 , . . . , ν h-1 , ν h ) -F (Π ′ , ν ′ 1 , . . . , ν ′ h-1 , ν ′ h )| ≤ T κ 2   h j=2 TV(π j , π ′ j )∥ν j ∥ 2 + ∥ν j -ν ′ j ∥ 2   + TV(π 1 , π ′ 1 ) + h j=1 ∥ν j -ν ′ j ∥ 1 (43) Suppose x ∈ S d-1 |T h (f (•; x), ν, π h ) -T h (f (•; x ′ ), ν ′ , π ′ h )| ≤ 2C µ ( √ d + ξd) (∥ν -ν ′ ∥ 1 + TV(π h , π ′ h ) + ∥x -x ′ ∥ 2 1 ) (44) | Th (f (•; x), ν, π h ) -Th (f (•; x ′ ), ν ′ , π ′ h )| ≤ 2 T κ 2 ( √ d + ξd) (∥ν -ν ′ ∥ 2 + TV(π h , π ′ h ) + ∥x -x ′ ∥ 2 ∥ν∥ 2 ) Lemma 11. Condition on the event G ϕ,h ⪰ κ 2 I for every h ∈ [H]. Fix some η > 0 and let Qη denote any η-net over Q. With probability at-least 1 -δ/4, the following hold simultaneously: 1. sup ν sup Π∈Q |E ν 1 (Π) -Êν 1 (Π)| ≤ C d κ log d| Qη| δ + Cη . 2. For h > 1: sup Π∈Q sup ν∈B d (1) |E ν h (Π) -Êν h (Π)| ≤ CdH κ log dH| Qη| δ + Hd log d η + C T κ 2 ηH 3. X 1 := sup Π=(π1,...,π H )∈Q inf x∈S d-1 T1 (f (•; x), π 1 ) -inf x∈S d-1 T 1 (f (•; x), π 1 ) X 1 ≤ C( √ d + ξd) κ log | Qη| δ + d log d η + Cη( √ d + ξd) 4. X h := sup ν∈B(1) Π=(π1,...,π H )∈Q inf x∈S d-1 Th (f (•; x); ν, π h ) -inf x∈S d-1 T h (f (•; x); ν, π h ) X h ≤ C( √ d + ξd) κ log | Qη|H δ + d log d η + Cη( √ d + ξd) C µ + T κ 2 Lemma 12. Π = (π 1 , . . . , π H ). For any η ≥ 0, and h ∈ [H], suppose E ν h (Π) ≤ η. Then, we have: ∥Eϕ(S h , A h ) -ν∥ 1 ≤ η . Proof. Let S 1:H , A 1:H ∼ M(Π). By Lemma 1, we conclude that: Eϕ(S 1 , A 1 ) = T 1 (ϕ, π 1 ). Therefore we conclude the lemma for the case h = 1. Now let h > 1. Now, note that for j > 1, we have: Eϕ(S j , A j ) = T j (ϕ, Eϕ(S j-1 , A j-1 ), π j ). There exists a sequence ν 1 , . . . , ν h-1 such that E ν1 1 (Π) + h j=2 ∥T j (ϕ, ν j-1 , π j ) -ν j ∥ 1 ≤ η 0 Letting E ν1 1 (Π) =: η 1 , ∥T j (ϕ, ν j-1 , π j )-ν j ∥ 1 =: η j , we have from the case h = 1 : ∥Eϕ(S 1 , A 1 )- ν 1 ∥ 1 ≤ η 1 . ∥Eϕ(S j , A j ) -ν j ∥ 1 = ∥T j (ϕ, Eϕ(S j-1 , A j-1 ), π j ) -ν j ∥ 1 ≤ ∥T j (ϕ, Eϕ(S j-1 , A j-1 ), π j ) -T j (ϕ, ν j-1 , π j )∥ 1 + ∥T j (ϕ, ν j-1 , π j ) -ν j ∥ 1 ≤ ∥ν j-1 -Eϕ(S j-1 , A j-1 )∥ 1 + η j We have used equation 31 in the last step. Continuing recursively, we conclude the result Proof of Theorem 3. We condition on the event described in Lemma 11. We suppose that κ, η and η 0 are related as in the statement of the theorem. We will apply these values whenever we invoke the concentration bounds obtained from Lemma 11 in all the inequalities below. First consider h = 1. Let Πf,1 = (π f,1 H , . . . , π f,H H ). By item 3 in Lemma 11, we have (with X 1 as defined in the lemma): inf x∈S d-1 T1 (f (; x), π f,1 1 ) ≥ sup Π∈Q inf x∈S d-1 T 1 (f (; x), π 1 ) -X 1 ≥ ζ -X 1 ≥ 3ζ 4 Similarly, we have: inf x∈S d-1 T 1 (f (; x), π f,1 1 ) ≥ inf x∈S d-1 T1 (f (; x), π f,1 1 ) -ζ 4 Combining the two displays above, we conclude the theorem for h = 1. Now consider h > 1. We will first show that the constraint Êν,h-1 (Π) ≤ η 0 is feasible for some Π ∈ Q and some ν. Note that, for any policy Π there exists a ν 1 , . . . , ν h-1 ∈ R d such that Eϕ(S j , A j ) = ν j whenever S 1:H , A 1:H ∼ M(Π). For the choice ν = ν h-1 , we must have E ν,h-1 (Π) = 0. Now, by item 1 and 2 of Lemma 11, we conclude that Êν,h-1 ≤ η 0 . Therefore this optimization is feasible. Consider the solutions to the optimization problem given by ν and Πf,h . Note again from Lemma 11 that E ν,h-1 ( Πf,h ) ≤ Ê ν,h-1 ( Πf,h ) + η 0 ≤ 2η 0 . Now, applying Lemma 12, we conclude that whenever S 1:H , A 1:H ∼ M( Πf,h ) ∥Eϕ(S h-1 , A h-1 ) -ν∥ 1 ≤ 2η 0 By a similar reasoning as the case h = 1, we conclude: inf x∈S d-1 T h (f (; x), ν, π f,h h ) ≥ 3 ζ 4 Now, applying equation 44, we conclude: inf x∈S d-1 Ef (S h , A h ; x) = inf x∈S d-1 T h (f (; x), Eϕ(S h-1 , A h-1 ), π f,h h ) ≥ inf x∈S d-1 T h (f (; x), ν, π f,h h ) -sup x∈S d-1 |T h (f (; x), Eϕ(S h-1 , A h-1 ), π f,h h ) -T h (f (; x), ν, π f,h h )| ≥ 3ζ 4 -2C µ ( √ d + ξd) (∥ν -Eϕ(S h-1 , A h-1 )∥ 1 ) ≥ ζ 2 In the last step, we have used the lipschitzness bound for T h given in Lemma 10. We will show that the conditions given in equation 2 are satisfied for ψ(S F PROOF OF THEOREM 4 Let the unknown row set in the iteration t in the matrix estimation procedure of Section 6.1 be denoted by Īt-1 . For the analysis, we will use the convention that Īt = ∅ if the procedure terminates before the t-th iteration. Suppose K t is such that for every t ≤ log N , we have: K t | Īt-1 | ≥ C r| Īt-1|+dr ζ 2 ξ 2 log d ζξ + C log( log N δ ) ζ 2 ξ 2 . We will then show that the event {| Īt | ≤ 1 10 | Īt-1 |∀t ≤ log N } ∩ { Θi = Θ * i , ∀i ∈ Ī∁ log N } has probability at-least 1 -δ. To show this, it is sufficient to consider the step t = 1 with Ī0 = [N ], K 1 = K, Ψ (1) = Ψ and show that with probability 1 -δ log N , Ī1 ≤ 9N 10 and Θi = Θ * i for every i ∈ Ī∁ 1 . The result then follows from a union bound. We will therefore establish the following structural lemma and prove the Theorem 4. The rest of the section is then dedicated to proving Lemma 13. Proof. • N (Γ i , Ψ i ). Therefore, we have:  P K k=1 ⟨ Θi , ψ ik ⟩ -θ *



Conditioned on G h , σ(I h ) has the same distribution as I h . Proof. Let {σ (s,a):[N ]→[N ] |(s, a) ∈ S ×A} be a set of arbitrary permutations on [N ].

p|S||A|) D.2 PROOF OF LEMMA 8 Proof. By Remark 4.3 in Wagenmaker et al. (2022), we show that non-linear rewards can be handled by the reward free RL algorithm in Phase 1 as long all the reward are uniformly bounded in [0, 1].

Suppose the distribution of ψ ik satisfies equation 2. Let K ≥ denote the set of all matrices ∆ with rank at most 2r such that L(∆, Ψ) = 0.Let I Z (∆) = {i ∈ [N ] : ∆ i ̸ = 0}. With probability 1 -exp(-cζ 2 ξ 2 N K) we must have: Y(Ψ)∩ ∆ : |I Z (∆)| > N 10 = ∅Proof of Theorem 4. Let Θ be the rank ≤ r matrix found satisfying L( Θ -Θ * , Ψ (t) ) = 0. By Lemma 13, we have that |I Z ( Θ -Θ * )| ≤ N 10 with probability at-least 1 -δ log N (by setting K = K 1 as in the statement of Theorem 4). By Lemma 14, the probability that there exists an i ∈ [N ] such that Θi ̸ = Θ * i , andK k=1 ⟨ Θi , ψik ⟩ -θ * ik is at most |I| • exp(-cζ 2 ξ 2 K) ≤ δ • N -cfor some large constant c. From this we conclude that Θi = Θ * i for every i ∈ Ī∁ 1 .Lemma 14. Fix any Θ. Suppose the distribution of (ψ ik ) i∈[N ],k∈[K] satisfies equation 2. Then, there exists a small enough constant c such that: P ∃i s.t. exp(-cζ 2 ξ 2 K).

Kζ 4 ξ 2 ∥Γi∥ 2 32d ≤ P N (Γ i , Ψ i ) < Kζ 2 ξ 2 8 ≤ P Bin(K, p 0 ) ≤ Kp0 2 ≤ exp(-cp 0 K) ∥u i -ûi ∥ ≤ η 2 and ∥v k -vk ∥ ≤ η 2 √ 2r for every i ∈ I and k ∈ [2r].In order to show that ∥Γ -Γ∥ 1,2,⊺ ≤ η, it is sufficient to show that ∥ Γi -Γi ∥ ≤ η for every i ∈[I]. ∥ Γi -Γ i ∥ = 2r k=1 (û ik -u ik )v k + 2r k=1 u ik (v k -vk ) ≤ 2r k=1 (û ik -u ik )v k + 2r k=1 u ik (v k -vk ) = 2r k=1 (û ik -u ik ) 2 + 2r k=1 u ik (v k -vk ) vk ∥ 2 ≤ η (58)Therefore B0,η is an η net with respect to ∥ • ∥ 1,2,⊺ . By Corollary 4.2.13 in Vershynin (2018), we can pick: ,η | exp(-cζ 2 ξ 2 |I|K) ≤ exp 2dr log( 4 √ 2r η + 1) + 2|I|r log( 4 η + 1) -cζ 2 ξ 2 |I|K (59) Therefore, whenever taking |I| ≥ N 10 and η = c 1 ζ 4 ξ 2 d for some constant c 1 small enough, and combining equation 59 with equation 55, we conclude that whenever K ≥ exp(-cζ 2 ξ 2 N K) Now, consider |I| ≥ N 10 . The number of such sets I is at-most exp(c 1 N ) for some constant c 1 > 0. Therefore, applying Lemma 17 along with the union bound over all I such that |I| ≥ N 10 we have: Corollary 1. Under the conditions of Lemma 17, we have: inf I⊆N |I|≥ N 10 inf Γ∈B(N,d,I,2r) L(Γ, Ψ) > c 0 ζ 4 ξ 2 d with probability at-least 1 -exp(-cζ 2 ξ 2 N K)

Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. Advances in neural information processing systems, 30, 2017. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Now, consider h > 1. Consider any η net over B d (1), denoted by Bd,η with respect to the norm ∥•∥ 1 . We can take | Bd,η | ≤ exp(Cd log(d/η))(Vershynin, 2018). Invoking Lemma 10, we conclude:

h , A h ) with parameters ζ/2 instead of ζ. ∥ψ(S h , A h )∥ 2 ≤ 1 almost surely follows from the definition of ψ. Now, Ef (S h , A h , x) ≥ ζ 2 for every x ∈ S d-1 implies E|⟨x, ψ(S h , A h )⟩| ≥ ζ 2 √ d .Using the definition of f (S h , A h , x) (see Section 5) and the fact that Ef (S h , A h , x) ≥ ζ 2 as established above, we conclude that for every x ∈ S d-1 , we also have:dξE⟨x, ψ(S h , A h )⟩ 2 ≤

Consider the Paley-Zygmund inequality, which states that for any positive random variable Z, Suppose i ∈ I and denote Γ i := Θi -Θ * i . By the properties of ψ ik , we have that E|⟨Γ i , ψ ik ⟩| ≥ ζ∥Γi∥ √ d and E|⟨Γ i , ψ ik ⟩| 2 ≤ ∥Γi∥ 2 ξ 2 d . Applying the Paley-Zygmund inequality to the random variable |⟨Γ i , ψ ik ⟩|, we conclude the result in equation 52: P |⟨ψ ik , Γ i ⟩| ≥

annex

Here Bin(K, p 0 ) denotes the binomial random variable. In the second step we have used the fact that N (Γ i , Ψ i ) is a sum of K independent Bernoulli random variables with probability of being 1 for each of them being at-least p 0 = ζ 2 ξ 2 4 . In the last step, we have used Sanov's theorem for large deviations. In the last step we have used Bernstein's inequality for concentration of sums of Bernoulli random variables (see Boucheron et al. (2013) ). The statement of the result then follows from a union bound argument over i ∈ I. Proof. For every ∆ ∈ M(N, d, I, 2r), we construct Γ such that:Now, by hypothesis, L(Γ, Ψ) > 0. This implies, there exists an i ∈ I and k ∈ K such that |⟨ψ ik , Γ i ⟩| > 0. This implies |⟨ψ ik , ∆ i ⟩| > 0 and thence we conclude that L(∆, Ψ) > 0.Lemma 16. Suppose Γ is such that ∥Γ i ∥ = 1 for every i ∈ I. Suppose the distribution of (ψ ik ) i∈[N ],k∈[K] satisfy equation 2. Then, there exists a small enough constant c such that:Proof. Consider the Paley-Zygmund inequality, which states that for any positive random variable Z,Suppose i ∈ I. By the properties of ψ ik , we have thatApplying the Paley-Zygmund inequality to the random variable |⟨Γ i , ψ ik ⟩|, we conclude the result in equation 52:4dN K N (Γ, Ψ) almost surely. Therefore, we have:Here Bin(|I|K, p 0 ) denotes the binomial random variable. In the second step we have used the fact that N (Γ, Ψ) is a sum of |I|K independent Bernoulli random variables with probability of being 1 for each of them being at-least p 0 = ζ 2 ξ 2 4 . In the last step, we have used Sanov's theorem for large deviations. In the last step we have used Bernstein's inequality for concentration of sums of Bernoulli random variables (see Boucheron et al. (2013)) Lemma 17. Suppose the distribution of (ψ ik ) i∈[N ],k∈[K] satisfy equation 2. Let |I| ≥ N 10 . There exist positive constants c 0 , c, C such that whenever KN ≥ Cr(N +d) ζ 2 ξ 2 log d ζξ , we have:Proof. It is sufficient to prove this result for Γ ∈ B 0 (N, d, I, 2r) ⊆ B(N, d, I, 2r), which is the set of all matrices such that ∥Γ i ∥ = 1 for every i ∈ I and 0 otherwise. Define ∥Γ∥ 1,2,⊺ :=In the third step, we have used the fact that ∥ψ ik ∥ ≤ 1 and the Cauchy-Schwarz inequality to implyTherefore, given any η net of B 0 (N, d, I, 2r), denoted by B0,η , we must have: infWe will now parametrize B 0 (N, d, I, 2r) as follows:Claim 8. Every Γ ∈ B 0 (N, d, I, 2r) can be written asWhere v 1 , . . . , v 2r are orthonormal vectors in R d and u i = (u ik ) 2r k=1 ∈ R 2r are such that ∥u i ∥ = 1.Proof. By the singular value decomposition, we have: Γ = W ΣV ⊺ for orthogonal matrices W, V and the singular value matrix Σ. Therefore, Γ ij = 2r k=1 w ik σ k v kj Denoting u ik := w ik σ k , we note that Γ i = 2r k=1 u ik v k , where v k is the k-th column of V . Now, it remains to show that ∥u i ∥ = 1. By ortho-normality of v 1 , . . . , v 2r and the definition of Γ, we have:Therefore, we construct an η-net for B 0 (N, d, I, 2r) as follows: consider any η/2-net over the sphere S 2r-1 , denoted by Ŝ η 2 (2r) with respect to the Euclidean norm. Similarly, consider any 

