REPRESENT TO CONTROL PARTIALLY OBSERVED SYS-TEMS: REPRESENTATION LEARNING WITH PROVABLE SAMPLE EFFICIENCY

Abstract

Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Represent to Control (RTC), which learns the representation at two levels while optimizing the policy. (i) For each step, RTC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, RTC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, RTC attains an O(1/ϵ 2 ) sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here ϵ is the optimality gap. To our best knowledge, RTC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.

1. INTRODUCTION

Deep reinforcement learning demonstrates significant empirical successes in Markov decision processes (MDPs) with large state spaces (Mnih et al., 2013; 2015; Silver et al., 2016; 2017) . Such empirical successes are attributed to the integration of representation learning into reinforcement learning. In other words, mapping the state to a low-dimensional feature enables model/value learning and optimal control in a sample-efficient manner. Meanwhile, it becomes more theoretically understood that the low-dimensional feature is the key to sample efficiency in the linear setting (Cai et al., 2020; Jin et al., 2020b; Ayoub et al., 2020; Agarwal et al., 2020; Modi et al., 2021; Uehara et al., 2021) . In contrast, partially observed Markov decision processes (POMDPs) with large observation and state spaces remain significantly more challenging. Due to a lack of the Markov property, the lowdimensional feature of the observation at each step is insufficient for the prediction and control of the future (Sondik, 1971; Papadimitriou and Tsitsiklis, 1987; Coates et al., 2008; Azizzadenesheli et al., 2016; Guo et al., 2016) . Instead, it is necessary to obtain a low-dimensional embedding of the history, which assembles the low-dimensional features across multiple steps (Hefny et al., 2015; Sun et al., 2016) . In practice, learning such features and embeddings requires various heuristics, e.g., recurrent neural network architectures and auxiliary tasks (Hausknecht and Stone, 2015; Li et al., 2015; Mirowski et al., 2016; Girin et al., 2020) . In theory, the best results are restricted to the tabular setting (Azizzadenesheli et al., 2016; Guo et al., 2016; Jin et al., 2020a; Liu et al., 2022) , which does not involve representation learning. To this end, we identify a class of POMDPs with a low-rank structure on the state transition kernel (but not on the observation emission kernel), which allows prediction and control in a sampleefficient manner. More specifically, the transition admits a low-rank factorization into two unknown features, whose dimension is the rank. On top of the low-rank transition, we define a Bellman operator, which performs a forward update for any finite-length trajectory. The Bellman operator allows us to further factorize the history across multiple steps to obtain its embedding, which assembles the per-step feature. By integrating the two levels of representation learning, that is, (i) feature learning at each step and (ii) embedding learning across multiple steps, we propose a sample-efficient algorithm, namely Represent to Control (RTC), for POMDPs with infinite observation and state spaces. The key to RTC is balancing exploitation and exploration along the representation learning process. To this end, we construct a confidence set of embeddings upon identifying and estimating the Bellman operator, which further allows efficient exploration via optimistic planning. It is worth mentioning that such a unified framework allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). We analyze the sample efficiency of RTC under the future and past sufficiency assumptions. In particular, such assumptions ensure that the future and past observations are sufficient for identifying the belief state, which captures the information-theoretic difficulty of POMDPs. We prove that RTC attains an O(1/ϵ 2 ) sample complexity that scales polynomially with the horizon and the dimension of the feature (that is, the rank of the transition). Here ϵ is the optimality gap. The polynomial dependency on the horizon is attributed to embedding learning across multiple steps, while polynomial dependency on the dimension is attributed to feature learning at each step, which is the key to bypassing the infinite sizes of the observation and state spaces. Contributions. In summary, our contribution is threefold. • We identify a class of POMDPs with the low-rank transition, which allows representation learning and reinforcement learning in a sample-efficient manner. • We propose RTC, a principled approach integrating embedding and control in the low-rank POMDP. • We establish the sample efficiency of RTC in the low-rank POMDP with infinite observation and state spaces. Related Work. Our work follows the previous studies of POMDPs. In general, solving a POMDP is intractable from both the computational and the statistical perspectives (Papadimitriou and Tsitsiklis, 1987; Vlassis et al., 2012; Azizzadenesheli et al., 2016; Guo et al., 2016; Jin et al., 2020a) . Given such computational and statistical barriers, previous works attempt to identify tractable POMDPs. In particular, Azizzadenesheli et al. (2016) ; Guo et al. (2016) ; Jin et al. (2020a) ; Liu et al. (2022) consider the tabular POMDPs with (left) invertible emission matrices. Efroni et al. (2022) considers the POMDPs where the state is fully determined by the most recent observations of a fixed length. Cayci et al. (2022) analyze POMDPs where a finite internal state can approximately determine the state. In contrast, we analyze POMDPs with the low-rank transition and allow the state and observation spaces to be arbitrarily large. Meanwhile, our analysis hinges on the future and past sufficiency assumptions, which only require that the density of the state is identified by that of the future and past observations, respectively. In recent work, Cai et al. (2022) also utilizes the low-rank property in the transition. Nevertheless, Cai et al. (2022) assumes that the feature representation of stateaction pairs is known, thus relieving the agent from feature learning. In contrast, we aim to recover the efficient state-action representation for planning. In terms of the necessity of exploration, Azizzadenesheli et al. (2016) ; Guo et al. (2016) analyze POMDPs where an arbitrary policy can conduct efficient exploration. Similarly, Cayci et al. (2022) consider POMDPs with a finite concentrability coefficient (Munos, 2003; Chen and Jiang, 2019) , where the visitation density of an arbitrary policy is close to that of the optimal policy. In contrast, (Misra et al., 2020; Zhang et al., 2022) . In contrast, we propose to recover the latent state based on interaction history. In addition, our work conducts latent recovery under the more challenging POMDP setup. See also §B for additional literature review on related study of latent state space models and MDPs. Notation We denote by R d + the space of d-dimensional vectors with nonnegative entries. We denote by L p (X ) the L p space of functions defined on X . We denote by ∆(d) the space of d-dimensional probability density arrays, namely, the d-dimensional nonnegative arrays that sums up to one. We denote by [H] = {1, . . . , H} the index set of size H. For a linear operator M mapping from an L p space to an L q space, we denote by ∥M ∥ p →q the operator norm of M . For a vector x ∈ R d , we denote by [x] i the i-th entry of x.

2. PARTIALLY OBSERVABLE MARKOV DECISION PROCESS

We define a partially observable Markov decision process (POMDP) by the following tuple, M = (S, A, O, {P h } h∈[H] , {O h } h∈[H] , r, H, µ 1 ) , where H is the length of an episode, µ 1 is the initial distribution of state s 1 , and S, A, O are the state, action, and observation spaces, respectively. Here P h (• | •, •) is the transition kernel, O h (• | •) is the emission kernel, and r(•) is the reward function. In each episode, the agent with the policy π = {π h } h∈[H] interact with the environment as follows. The environment select an initial state s 1 drawn from the distribution µ 1 . In the h-th step, the agent receives the reward r(o h ) and the observation o h drawn from the observation density O h (• | s h ), and makes the decision a h = π h (τ h 1 ) according to the policy π h , where τ h 1 = {o 1 , a 1 , . . . , a h-1 , o h } is the interaction history. The environment then transits into the next state s h+1 drawn from the transition distribution P h (• | s h , a h ). The procedure terminates until the environment transits into the termination state s H+1 . In the sequel, we assume that the action space A is finite with capacity |A| = A. Meanwhile, we highlight that the observation and state spaces O and S are possibly infinite. Value Functions and Learning Objective. For a given policy π = {π h } h∈[H] , we define the following value function that captures the expected cumulative rewards from interactions, V π = E π H h=1 r(o h ) . (2.1) Here we denote by E π the expectation taken with respect to the policy π, the transition dynamics, and the emission. Our goal is to derive a policy that maximizes the cumulative rewards. In particular, we aim to derive the ϵ-suboptimal policy π such that V π * -V π ≤ ϵ, based on minimal interactions with the environment, where π * = argmax π V π is the optimal policy. Notations of POMDP. In the sequel, we introduce notations of the POMDP to simplify the discussion. We define a h+k-1 h = {a h , a h+1 , . . . , a h+k-1 }, o h+k h = {o h , o h+1 , . . . , o h+k } as the sequences of actions and observations, respectively. Correspondingly, we write r(o H 1 ) = H h=1 r(o h ) as the cumulative rewards for the observation sequence o H 1 . Meanwhile, we denote by τ h+k h the sequence of interactions from the h-th step to the (h + k)-th step, namely, τ h+k h = {o h , a h , . . . , o h+k-1 , a h+k-1 , o h+k } = {a h+k-1 h , o h+k h }. Similarly, we denote by τ h+k h the sequence of interactions from the h-th step to the (h + k)-th step that includes the latest action a h+k , namely, τ h+k h = {o h , a h , . . . , o h+k , a h+k } = {a h+k h , o h+k h }. In addition, with a slight abuse of notation, we define P π (τ h+k h ) = P π (o h , . . . , o h+k | a h , . . . , a h+k-1 ) = P π (o h+k h | a h+k-1 h ), P π (τ h+k h | s h ) = P π (o h , . . . , o h+k | s h , a h , . . . , a h+k-1 ) = P π (o h+k h | s h , a h+k-1 h ). Extended POMDP. To simplify the discussion and notations in our work, we introduce an extension of the POMDP, which allows us to access steps h smaller than zero and larger than the length H of an episode. In particular, the interaction of an agent with the extended POMDP starts with a dummy initial state s 1-ℓ for some ℓ > 0. During the interactions, all the dummy action and observation sequences τ 0 1-ℓ = {o 1-ℓ , a 1-ℓ , . . . , o 0 , a 0 } leads to the same initial state distribution µ 1 that defines the POMDP. Moreover, the agent is allowed to interact with the environment for k steps after observing the final observation o H of an episode. Nevertheless, the agent only collects the reward r(o h ) at steps h ∈ [H], which leads to the same learning objective as the POMDP. In addition, we denote by [H] + = {1 -ℓ, . . . , H + k} the set of steps in the extended POMDP. In the sequel, we do not distinguish between a POMDP and an extended POMDP for the simplicity of presentation.

3. A SUFFICIENT EMBEDDING FOR PREDICTION AND CONTROL

The key of solving a POMDP is the practice of inference, which recovers the density or linear functionals of density (e.g., the value functions) of future observation given the interaction history. To this end, previous approaches (Shani et al., 2013) typically maintain a belief, namely, a conditional density P(s h = • | τ h 1 ) of the current state given the interaction history. The typical inference procedure first conducts filtering, namely, calculating the belief at (h + 1)-th step given the belief at h-th step. Upon collecting the belief, the density of future observation is obtained via prediction, which acquires the distribution of future observations based on the distribution of state s h+1 . In the case that maintaining a belief or conducting the prediction is intractable, previous approaches establish a predictive state (Hefny et al., 2015; Sun et al., 2016) , which is an embedding that is sufficient for inferring the density of future observations given the interaction history. Such approaches typically recover the filtering of predictive representations by solving moment equations. In particular, Hefny et al. (2015) ; Sun et al. (2016) establishes such moment equations based on structural assumptions on the filtering of such predictive states. Similarly, Anandkumar et al. (2012) ; Jin et al. (2020a) establishes a sequence of observation operators and recovers the trajectory density via such observation operators. Motivated by the previous work, we aim to construct a embedding that are both learn-able and sufficient for control. A sufficient embedding for control is the density of the trajectory, namely, Φ(τ H 1 ) = P(τ H 1 ). (3.1) Such an embedding is sufficient as it allows us to estimate the cumulative rewards function V π of an arbitrary given policy π. In the sequel, we aims to estimate such an embedding and further conduct planning based on the estimated embedding. Nevertheless, estimating such an embedding is challenging when the length H of an episode and the observation space O are large. To this end, we exploit the low-rank structure in the state transition of POMDPs.

3.1. LOW-RANK POMDP

Assumption 3.1 (Low-Rank POMDP). We assume that the transition kernel P h takes the following low-rank form for all h ∈ [H] + , P h (s h+1 | s h , a h ) = ψ * h (s h+1 ) ⊤ ϕ * h (s h , a h ), where ψ * h : S → R d + , ϕ * h : S × A → ∆(d) are unknown features. Here recall that we denote by [H] + = {1 -ℓ, . . . , H + k} the set of steps in the extended POMDP. Note that our low-rank POMDP assumption does not specify the form of emission kernels. In contrast, we only require the transition kernels of states to be linear in unknown features. Function Approximation. We highlight that the features in Assumption 3.1 are unknown to us. Correspondingly, we assume that we have access to a parameter space Θ that allows us to fit such features as follows. Definition 3.2 (Function Approximation). We define the following function approximation space F Θ = {F Θ h } h∈[H] corresponding to the parameter space Θ, F Θ h = (ψ θ h , ϕ θ h , O θ h ) : θ ∈ Θ , ∀h ∈ [H] + . Here, O θ h : S × O → R + is an emission kernel and ψ θ h : S → R d + , ϕ θ h : S → ∆(d) are features for all h ∈ [H] + and θ ∈ Θ. In addition, it holds that ψ θ (•) ⊤ ϕ θ (s h , a h ) defines a probability over s h+1 ∈ S for all h ∈ [H] + and (s h , a h ) ∈ S × A. Here we denote by ψ θ h , ϕ θ h , O θ h a parameterization of features and emission kernels. In practice, one typically utilizes linear or neural network parameterization for the features and emission kernels. In the sequel, we write P θ and P θ,π as the probability densities corresponding to the transition dynamics defined by {ψ θ h , ϕ θ h , O θ h } h∈[H] and policy π, respectively. We impose the following realizability assumption to ensure that the true model belongs to the parameterized function space F Θ . Assumption 3.3 (Realizable Parameterization). We assume that there exists a parameter θ * ∈ Θ, such that ψ θ * h = ψ * h , ϕ θ * h = ϕ * h , and O θ * h = O h for all h ∈ [H]. We define the following forward emission operator as a generalization of the emission kernel. Definition 3.4 (Forward Emission Operator). We define the following forward emission operator U θ h : L 1 (S) → L 1 (A k × O k+1 ) for all h ∈ [H], (U θ h f )(τ h+k h ) = S P θ (τ h+k h | s h ) • f (s h )ds h , ∀f ∈ L 1 (S), ∀τ h+k h ∈ A k × O k+1 . (3.2) Here recall that we denote by τ h+k h = {a h+k-1 h , o h+k h } ∈ A k × O k+1 the trajectory of interactions. In addition, recall that we define P θ (τ k h | s h ) = P θ (o h+k h | s h , a h+k-1 h ) for notational simplicity. We remark that here we omit the dependency of U θ h on the length k of trajectory to simplify the notation. We remark that when applying to a belief or a density over state s h , the forward emission operator returns the density of trajectory τ h+k h of k steps ahead of the h-th step. Bottleneck Factor Interpretation of Low-Rank Transition. Recall that in Assumption 3.1, the feature ϕ * h maps from the state-action pair (s h , a h ) ∈ S × A to a d-dimensional simplex in ∆(d). Equivalently, one can consider the low-rank transition as a latent variable model, where the next state s h+1 is generated by first generating a bottleneck factor q h ∼ ϕ * (s h , a h ) and then generating the next state s h+1 by [ψ * (•)] q h . In other words, the probability array ϕ * (s h , a h ) ∈ ∆(d) induces a transition dynamics from the state-action pair (s h , a h ) to the bottleneck factor q h ∈ [d] as follows, P h (q h | s h , a h ) = ϕ * h (s h , a h ) q h , ∀q h ∈ [d]. Correspondingly, we write P h (s h+1 | q h ) = [ψ * h (s h+1 )] q h the transition probability from the bottleneck factor q h ∈ [d] to the state s h+1 ∈ S. See Figure 1 for an illustration of the data generating process with the bottleneck factors. Understanding Bottleneck Factor. Utilizing the low-rank structure of the state transition requires us to understand the bottleneck factors {q h } h∈[H] defined by the low-rank transition. We highlight that the bottleneck factor q h is a compressed and sufficient factor for inference. In particular, the bottleneck factor q h determines the distribution of next state s h+1 through the feature ψ * h (s h+1 = •) = P(s h+1 = • | q h = •). Such a property motivate us to obtain our desired embedding via decomposing the density of trajectory based on the feature set {ψ * h } h∈[H] + . To achieve such a decomposition, we first introduce the following sufficiency condition for all the parameterized features ψ θ h with θ ∈ Θ. Assumption 3.5 (Future Sufficiency). We define the mapping g θ h : A k × O k+1 → R d for all parameter θ ∈ Θ and h ∈ [H] as follows, g θ h = U θ h ψ θ h-1 1 , . . . , U θ h ψ θ h-1 d ⊤ , where we denote by [ψ θ h-1 i the i-th entry of the mapping ψ θ h-1 for all i ∈ [d]. We assume for some k > 0 that the matrix M θ h = A k ×O k+1 g θ h (τ h+k h )g θ h (τ h+k h ) ⊤ dτ h+k h ∈ R d×d is invertible. We denote by M θ, † h the inverse of M θ h for all parameter θ ∈ Θ and h ∈ [H]. Here {s h , s h+1 }, {o h , o h+1 }, a h , r h are the states, observations, action, and reward, respectively. In addition, we denote by q h the bottleneck factor induced by the low-rank transition, which depends on the state and action pair (s h , a h ) and determines the density of next state s h+1 . In the DAG, we represent observable and unobservable variables by the shaded and unshaded nodes, respectively. In addition, we use the dashed node and arrows for the latent factor q h and its corresponding transitions, respectively. • • • • • • • • • sh qh sh+1 ah oh rh oh+1 • • • • • • • • • Intuitively, the future sufficiency condition in Assumption 3.5 guarantees that the density of trajectory τ h+k h in the future captures the information of the bottleneck variable q h-1 , which further captures the belief at the h-th step. To see such a fact, we have the following lemma. Lemma 3.6 (Pseudo-Inverse of Forward Emission). We define linear operator U θ, † h : L 1 (A k × O k+1 ) → L 1 (S) for all θ ∈ Θ and h ∈ [H] as follows, (U θ, † h f )(s h ) = A k ×O k+1 ψ θ h-1 (s h ) ⊤ M θ, † h g θ h (τ h+k h ) • f (τ h+k h )dτ h+k h , where f ∈ L 1 (A k ×O k+1 ) is the input of linear operator U θ, † h and g θ h is the mapping defined in Assumption 3.5. Under Assumptions 3.1 and 3.5, it holds for all h ∈ [H], θ ∈ Θ, and π ∈ Π that U θ, † h U θ h (P θ,π h ) = P θ,π h . Here P θ,π h ∈ L 1 (S) maps from all state s h ∈ S to the probability P θ,π h (s h ), which is the probability of visiting the state s h in the h-th step when following the policy π and the model defined by parameter θ. Proof. See §D.1 for a detailed proof. By Lemma 3.6, the forward emission operator U θ h defined in Definition 3.4 has a pseudo-inverse U θ, † h under the future sufficiency condition in Assumption 3.5. Thus, one can identify the belief state by inverting the conditional density of the trajectory τ h+k h given the interaction history τ h 1 . More importantly, such invertibility further allows us to decompose the desired embedding Φ(τ H 1 ) in (3.1) across steps, which we introduce in the sequel.

3.2. MULTI-STEP EMBEDDING DECOMPOSITION VIA BELLMAN OPERATOR

To accomplish the multi-step decomposition of embedding, we first define the Bellman operator as follows.

Definition 3.7 (Bellman Operator). We define the Bellman operators

B θ h (a h , o h ) : L 1 (A k × O k+1 ) → L 1 (A k × O k+1 ) for all (a h , o h ) ∈ A × O and h ∈ [H] as follows, B θ h (a h , o h )f (τ h+k+1 h+1 ) = S P θ (τ h+k+1 h | s h ) • (U θ, † h f )(s h )ds h , ∀τ h+k+1 h+1 ∈ A k × O k+1 . Here recall that we denote by τ h+k+1 h = {o h+k+1 h , a h+k h } and P θ (τ h+k+1 h | s h ) = P θ (o h+k+1 h | s h , a h+k+1 h ) for notational simplicity. We call B θ h (a h , o h ) in Definition 3.7 a Bellman operator as it performs a temporal transition from the density of trajectory τ h+k h to the density of trajectory τ h+k+1 h+1 and the observation o h , given that one take action a h at the h-th step. More specifically, Assumption 3.5 guarantees that the density of trajectory τ h+k h identifies the density of s h in the h-th step. The Bellman operator then performs the transition from the density of s h to the density of the trajectory τ h+k+1 h+1 and observation o h given the action a h . The following Lemma shows that our desired embedding Φ(τ H 1 ) can be decomposed into products of the Bellman operators defined in Definition 3.7. Lemma 3.8 (Embedding Decomposition) . Under Assumptions 3.1 and 3.5, it holds for all the parameter θ ∈ Θ that P θ (τ H 1 ) = 1 A k • A k ×O k+1 B θ H (o H , a H ) . . . B θ 1 (o 1 , a 1 )b θ 1 (τ

H+1

)dτ H+k+1 . Here recall that we denote by τ H+k+1 H+1 = {a H+k H+1 , o H+k+1 H+1 } the dummy future trajectory. Meanwhile, we define the following initial trajectory density, b θ 1 (τ k 1 ) = U θ 1 µ 1 = P θ (τ k 1 ), ∀τ k 1 ∈ A k × O k+1 . Proof. See §D.3 for a detailed proof. By Lemma 3.8, we can obtain the desired representation Φ(τ H 1 ) = P(τ H 1 ) based on the product of the Bellman operators. It now remains to estimate the Bellman operators across each step. In the sequel, we introduce an identity that allows us to recover the Bellman operators based on observations. Estimating Bellman Operator. In the sequel, we introduce the following notation to simplify our discussion, z h = τ h+k h = {o h , a h , . . . , a h+k-1 , o h+k } ∈ A k × O k+1 , w h-1 = τ h-1 h-ℓ = {o h-ℓ , a h-ℓ , . . . , o h-1 , a h-1 } ∈ A ℓ × O ℓ . We first define two density mappings that induce the identity of the Bellman Operator. We define the density mapping X θ,π h : A ℓ × O ℓ → L 1 (A k × O k+1 ) as follows, X θ,π h (w h-1 ) = P θ,π (w h-1 , z h = •), ∀w h-1 ∈ A ℓ × O ℓ . (3.6) Intuitively, the density mapping X θ,π h maps from an input trajectory w h-1 to the density of z h , which represents the density of k-steps interactions following the input trajectory w h-1 . Similarly, we define the density mapping Y θ,π h : A ℓ+1 × O ℓ+1 → L 1 (A k × O k+1 ) as follows, Y θ,π h (w h-1 , a h , o h ) = P θ,π (w h-1 , a h , o h , z h+1 = •), ∀(w h-1 , a h , o h ) ∈ A ℓ+1 × O ℓ+1 (3.7) Based on the two density mappings defined in (3.6 ) and (3.7), respectively, we have the following identity for all h ∈ [H] and θ ∈ Θ, B θ h (a h , o h )X θ,π h (w h-1 ) = Y θ,π h (w h-1 , a h , o h ), ∀w h-1 ∈ A ℓ+1 × O ℓ+1 . (3.8) See §D.2 for the proof of (3.8) . We highlight that the identity in (3.8) are density mappings involving the observations and actions, and can be estimated based on observable variables from the POMDP. Upon fitting such density mappings, we can recover the Bellman operator B θ * h (a h , o h ) by solving the identity in (3.8 ). An Overview of Embedding Learning. We now summarize the learning procedure of the embedding. First, we estimate the density mappings defined in (3.6 ) and (3.7) under the true parameter θ * based on interaction history. Second, we estimate the Bellman operators {B θ * h (a h , o h )} h∈[H] based on the identity in (3.8) and the estimated density mappings in the first step. Finally, we recover the embedding Φ(τ H 1 ) by assembling the Bellman operators according to Lemma 3.8.

4. ALGORITHM

In what follows, we present Represent to Control (RTC), an online learning algorithm that iteratively learns the embedding and conduct control based on the embedding learned. In particular, RTC iteratively fits the density mappings defined in (3.6 ) and (3.7) with respect to the sampling policy, and fit the Bellman operators by the identity in (3.8) . Finally, RTC conducts optimistic planning by the confidence set identified in embedding learning. See §C for the detailed procedure and Algorithm 1 for a summarization of RTC.

4.1. DENSITY ESTIMATION

In the embedding learning, we need and estimator to recover the density mappings defined in (3.6) and (3.7) . In practice, various approaches are available in fitting the density by observations. In what follows, we unify such density estimation approaches by a density estimation oracle. Assumption 4.1 (Density Estimation Oracle). We assume that we have access to a density estimation oracle E(•). Moreover, for all δ > 0 and dataset D drawn from the density p of size n following a martingale process, we assume that ∥E(D) -p∥ 1 ≤ C • w E • log(1/δ)/n with probability at least 1 -δ. Here C > 0 is an absolute constant and w E is a parameter that depends on the density estimation oracle E(•). We highlight that such convergence property can be achieved by various density estimations. In particular, when the function approximation space P of E(•) is finite, Assumption 4.1 holds for the maximum likelihood estimation (MLE) and the generative adversial approach with w E = log |P| (Geer et al., 2000; Zhang, 2006; Agarwal et al., 2020) . Meanwhile, w E scales with the entropy integral of P endowed with the Hellinger distance if P is infinite (Geer et al., 2000; Zhang, 2006) . In addition, Assumption 4.1 holds for the reproducing kernel Hilbert space (RKHS) density estimation (Gretton et al., 2005; Smola et al., 2007; Cai et al., 2022) with w E = poly(d), where d is rank of the low-rank transition (Cai et al., 2022) . Upon fitting the density mappings X t h and Y t h in the t-th iterate, we estimate the Bellman operators by minimizing the following objective, L t h (θ) = sup a h h-ℓ ∈A ℓ+1 O ℓ+1 ∥B θ h (a h , o h ) X t h (w h-1 ) -Y t h (w h-1 , a h , o h )∥ 1 do h h-ℓ . (4.1) Here recall that we define the shorthand w h-1 = {o h-ℓ , a h-ℓ , . . . , o h-1 , a h-1 } in (3.5).

4.2. OPTIMISTIC PLANNING

The learning of Bellman operators allows us to identify a confidence interval for the parameter and the associated embedding. In particular, we define the following confidence set, C t = θ ∈ Θ : max ∥b θ 1 -b t 1 ∥ 1 , L t h (θ) ≤ β t • 1/t, ∀h ∈ [H] , where β t is the tuning parameter in the t-th iterate. To conduct optimistic planning, we seek for the policy that maximizes the return among all parameters θ ∈ C t and the corresponding features. The update of policy takes the following form, π t ← argmax π∈Π max θ∈C t V π (θ). Here V π (θ) is the cumulative rewards estimated based on the embedding induced by θ. See §C for the details.

5. ANALYSIS

In what follows, we present the sample complexity analysis of RTC presented in Algorithm 1. Our analysis hinges on the following assumptions. Assumption 5.1 (Bounded Pseudo-Inverse). We assume that ∥U θ, † h ∥ 1 →1 ≤ ν for all θ ∈ Θ and h ∈ [H], where ν > 0 is an absolute constant. We remark that the upper bound of the pseudo-inverse in Assumption 5.1 quantifies the fundamental difficulty of solving the POMDP. In particular, the pseudo-inverse of forward emission recovers the state density at the h-th step based on the trajectory τ h+k h from the h-th step to the (h + k)-th step. Thus, the upper bound ν on such pseudo-inverse operator characterizes how ill-conditioned the belief recovery task is based on the trajectory τ h+k 1: Initialization: Set π 0 as a deterministic policy. Set the dataset D 0 h (a h+k h-ℓ ) as an empty set for all (h, a h+k h-ℓ ) ∈ [H] × A k+ℓ+1 . 2: for t ∈ [T ] do 3: for (h, a h+k h-ℓ ) ∈ [H] × A k+ℓ+1 do 4: Start a new episode from the (1 -ℓ)-th step.

5:

Execute policy π t-1 until the (h -ℓ)-th step and receive the observations o h-ℓ 1-ℓ . 6: Execute the action sequence a h+k h-ℓ regardless of the observations and receive the observations o h+k+1 h-ℓ+1 . 7: Update the dataset D t h (a h+k h-ℓ ) ← D t-1 h (a h+k h-ℓ ) ∪ o h+k+1 h-ℓ . 8: end for 9: Estimate the density of trajectory P t h (• | a h+k h-ℓ ) ← E D t (a h+k h-ℓ ) for all h ∈ [H]. 10: Update the density mappings X t h and Y t h as follows, X t h (w h-1 ) = P t h (w h-1 , z h = •), Y t h (w h-1 , a h , o h ) = P t h (w h-1 , a h , o h , z h+1 = •). 11: Update the initial density estimation b t 1 (τ H 1 ) ← P t (τ H 1 ). 12: Update the confidence set C t by (4.2). 13: Update the policy π t ← argmax π∈Π max θ∈C t V π (θ). 14: end for 15: Output: policy set {π t } t∈[T ] . Assumption 5.2 (Past Sufficiency). We define for all h ∈ [H] the following reverse emission operator F θ,π h : R d → L 1 (O ℓ × A ℓ ) for all h ∈ [H], π ∈ Π, and θ ∈ Θ, (F θ,π h v)(τ h-1 h-ℓ ) = q h-1 ∈[d] [v] q h-1 • P θ,π (o h-1 h-ℓ | q h-1 , a h-1 h-ℓ ), ∀v ∈ R d , where (τ h-1 h-ℓ ) ∈ A ℓ × O ℓ . We assume for some ℓ > 0 that the operator F θ,π h is left invertible for all h ∈ [H], π ∈ Π, and θ ∈ Θ. We denote by F θ,π, † h the left inverse of F θ,π h . We assume further that ∥F θ,π, † h ∥ 1 →1 ≤ γ for all h ∈ [H], π ∈ Π, and θ ∈ Θ, where γ > 0 is an absolute constant. We remark that the left inverse F θ,π, † h of reverse emission operator F θ,π h recovers the density of the bottleneck factor q h-1 based on the density of trajectory τ h-1 h-ℓ from the (h-ℓ)-th step to the (h-1)th step. Intuitively, the past sufficiency assumption in Assumption 5.2 guarantees that the density of trajectory τ h-1 h-ℓ from the past captures sufficient information of the bottleneck factor q h-1 , which further determines the state distribution at the h-th step. Thus, similar to the upper bound ν in Assumption 5.1, the upper bound γ in Assumption 5.2 characterizes how ill-conditioned the belief recovery task is based on the trajectory τ h-1 h-ℓ generated by the policy π. In what follows, we analyze the mixture policy π T of the policy set {π t } t∈[T ] returned by RTC in Algorithmn 1. In particular, the mixture policy π T is executed by first sampling a policy π uniformly from the policy set {π t } t∈[T ] in the beginning of an episode, and then executing π throughout the episode. Theorem 5.3. Let π T be the mixture policy of the policy set {π t } t∈[T ] returned by Algorithm 1. Let β t = (ν + 1) 2k+ℓ) • (k + ℓ) • log(H • A/ϵ)/ϵ 2 . Under Assumptions 3.1, 3.5, 4.1, 5.1, and 5 .2, it holds with probability at least 1 -δ that π T is ϵ-suboptimal. • A 2k • w E • (k + ℓ) • log(H • A • T ) for all t ∈ [T ] and T = O γ 2 • ν 4 • d 2 • w 2 E • H 2 • A 2( Proof. See §E.3 for a detailed proof. In Theorem 5.3, we fix the lengths of future and past trajectories k and ℓ, respectively, such that Assumptions 3.5 and 5.2 holds. Theorem 5.3 shows that the mixture policy π T of the policy set {π t } t∈[T ] returned by RTC is ϵ-suboptimal if the number of iterations T scales with O(1/ϵ 2 ). We remark that such a dependency regarding ϵ is information-therotically optimal for reinforcement learning in MDPs (Ayoub et al., 2020; Agarwal et al., 2020; Modi et al., 2021; Uehara et al., 2021) , which is a special case of POMDPs. In addition, the sample complexity T depends polynomially on the length of horizon H, number of actions A, the dimension d of the low-rank transition in Assumption 3.1, and the upper bounds ν and γ in Assumptions 5.1 and 5.2, respectively. We highlight that the sample complexity depends on the observation and state spaces only through the dimension d of the low-rank transition, extending the previous sample efficiency analysis of tabular POMDPs (Azizzadenesheli et al., 2016; Jin et al., 2020a) . In addition, the sample complexity depends on the upper bounds of the operator norms ν and γ in Assumptions 5.1 and 5.2, respectively, which quantify the fundamental difficulty of solving the POMDP. See §G for the analysis under the tabular POMDP setting.



Figure 1: Directed acyclic graph (DAG) of a POMDP with low-rank transition.Here {s h , s h+1 }, {o h , o h+1 }, a h , r h are the states, observations, action, and reward, respectively. In addition, we denote by q h the bottleneck factor induced by the low-rank transition, which depends on the state and action pair (s h , a h ) and determines the density of next state s h+1 . In the DAG, we represent observable and unobservable variables by the shaded and unshaded nodes, respectively. In addition, we use the dashed node and arrows for the latent factor q h and its corresponding transitions, respectively.

In what follows, we impose a similar past sufficiency assumption. Algorithm 1 Represent to Control Require: Number of iterates T . A set of tuning parameters {β t } t∈[T ] .

allows us to estimate the Bellman operator B θ *

