MULTI-SKILL MOBILE MANIPULATION FOR OBJECT REARRANGEMENT

Abstract

We study a modular approach to tackle long-horizon mobile manipulation tasks for object rearrangement, which decomposes a full task into a sequence of subtasks. To tackle the entire task, prior work chains multiple stationary manipulation skills with a point-goal navigation skill, which are learned individually on subtasks. Although more effective than monolithic end-to-end RL policies, this framework suffers from compounding errors in skill chaining, e.g., navigating to a bad location where a stationary manipulation skill can not reach its target to manipulate. To this end, we propose that the manipulation skills should include mobility to have flexibility in interacting with the target object from multiple locations and at the same time the navigation skill could have multiple end points which lead to successful manipulation. We operationalize these ideas by implementing mobile manipulation skills rather than stationary ones and training a navigation skill trained with region goal instead of point goal. We evaluate our multi-skill mobile manipulation method M3 on 3 challenging long-horizon mobile manipulation tasks in the Home Assistant Benchmark (HAB), and show superior performance as compared to the baselines. 1. We study how to formulate mobile manipulation skills, and empirically show that they are more robust to compounding errors in skill chaining than stationary counterparts; 2. We devise a region-goal navigation reward for mobile manipulation, which shows better performance and stronger generalizability than the point-goal counterpart in previous works; 3. We show that our improved multi-skill mobile manipulation pipeline can achieve superior performance on long-horizon mobile manipulation tasks without bells and whistles, which can serve as a strong baseline for future study.

1. INTRODUCTION

Building AI with embodiment is an important future mission of AI. Object rearrangement (Batra et al., 2020) is considered as a canonical task for embodied AI. The most challenging rearrangement tasks (Szot et al., 2021; Ehsani et al., 2021; Gan et al., 2021) are often long-horizon mobile manipulation tasks, which demand both navigation and manipulation abilities, e.g., to move to certain locations and to pick or place objects. It is challenging to learn a monolithic RL policy for complex long-horizon mobile manipulation tasks, due to challenges such as high sample complexity, complicated reward design, and inefficient exploration. A practical solution to tackle a long-horizon task is to decompose it into a set of subtasks, which are tractable, short-horizon, and compact in state or action spaces. Each subtask can be solved by designing or learning a skill, so that a sequence of skills can be chained to complete the entire task (Lee et al., 2018; Clegg et al., 2018; Lee et al., 2019; 2021) . For example, skills for object rearrangement can be picking or placing objects, opening or closing fridges and drawers, moving chairs, navigating in the room, etc. Achieving successful object rearrangement using this modular framework requires careful subtask formulation such that skills trained for these subtasks can be chained together effectively. We define three desirable properties for skills to solve diverse long-horizon tasks: achievability, composability, and reusability. Note that we assume each subtask is associated with a set of initial states. Then, achievability quantifies the portion of initial states solvable by a skill. A pair of skills are composable if the initial states achievable by the succeeding skill can encompass the terminal states of the preceding skill. This encompassment requirement is necessary to ensure robustness to mild compounding errors. However, trivially enlarging the initial set of a subtask increases learning difficulty and may lead to many unachievable initial states for the designed/learned skill. Last, a skill is reusable if it can be directly chained without or with limited fine-tuning (Clegg et al., 2018; Lee et al., 2021) . According to our experiments, effective subtask formulation is critical though largely overlooked in the literature. Figure 1 : 1a provides an overview of our multi-skill mobile manipulation (M3) method. The inactive part of the robot is colored gray. Previous approaches exclusively activate either the mobile platform or manipulator for each skill, and suffer from compounding errors in skill chaining given limited composability of skills. We introduce mobility to manipulation skills, which effectively enlarges the feasible initial set, and a region-goal navigation reward to facilitate learning the navigation skill. 1b illustrates one task (SetTable) in the Home Assistant Benchmark (Szot et al., 2021) , where the robot needs to navigate in the room, open the drawers or fridge, pick multiple objects in drawers or fridge and place them on the table. Best viewed in motion at the project website 1 . In the context of mobile manipulation, skill chaining poses many challenges for subtask formulation. For example, an imperfect navigation skill might terminate at a bad location where the target object is out of reach for a stationary manipulation skill (Szot et al., 2021) . To tackle such "hand-off" problems, we investigate how to formulate subtasks for mobile manipulation. First, we replace stationary (fixed-base) manipulation skills with mobile counterparts, which allow the base to move when the manipulation is undertaken. We observe that mobile manipulation skills are more robust to compounding errors in skill chaining, and enable the robot to make full use of its embodiment to better accomplish subtasks, e.g., finding a better location with less clutter and fewer obstacles to pick an object. We emphasize how to generate initial states of manipulation skills as a trade-off between composability and achievability in Sec 4.1. Second, we study how to translate the start of manipulation skills to the navigation reward, which is used to train the navigation skill to connect manipulation skills. Note that the goal position in mobile manipulation plays a very different role from that in point-goal (Wijmans et al., 2019; Kadian et al., 2020) navigation. On the one hand, the position of a target object (e.g., on the table or in the fridge) is often not directly navigable; on the other hand, a navigable position close to the goal position can be infeasible due to kinematic and collision constraints. Besides, there exist multiple feasible starting positions for manipulation skills, yet previous works such as Szot et al. (2021) train the navigation skill to learn a single one, which is selected heuristically and may not be suitable for stationary manipulation. Thanks to the flexibility of our mobile manipulation skills, we devise a region-goal navigation reward to address those issues, detailed in Sec 4.2. In this work, we present our improved multi-skill mobile manipulation method M3, where mobile manipulation skills are chained by the navigation skill trained with our region-goal navigation reward. It achieves an average success rate of 63% on 3 long-horizon mobile manipulation tasks in the Home Assistant Benchmark (Szot et al., 2021) , as compared to 50% for our best baseline. Fig 1 provides an overview of our method and tasks. Our contributions are listed as follows: 2 RELATED WORK

2.1. MOBILE MANIPULATION

Rearrangement (Batra et al., 2020) is "to bring a given physical environment into a specified state". We refer readers to Batra et al. (2020) for a comprehensive survey. Many existing RL tasks can be considered as instances of rearrangement, e.g., picking and placing rigid objects (Zhu et al., 2020; Yu et al., 2020) or manipulating articulated objects (Urakami et al., 2019; Mu et al., 2021) . However, they mainly focus on stationary manipulation (Urakami et al., 2019; Zhu et al., 2020; Yu et al., 2020) or individual, short-horizon skills (Mu et al., 2021) . Recently, several benchmarks like Home Assistant Benchmark (HAB) (Szot et al., 2021) , ManipulaTHOR (Ehsani et al., 2021) and ThreeDWorld Transport Challenge (Gan et al., 2021) , are proposed to study long-horizon mobile manipulation tasks. They usually demand that the robot rearranges household objects in a room, requiring exploration and navigation (Anderson et al., 2018; Chaplot et al., 2020) between interacting with objects entirely based on onboard sensing, without any privileged state or map information. Mobile manipulation (RAS, 2022) refers to "robotic tasks that require a synergistic combination of navigation and interaction with the environment". It has been studied long in the robotics community. Ni et al. (2021) provides a summary of traditional methods, which usually require perfect knowledge of the environment. One example is task-and-motion-planning (TAMP) (Srivastava et al., 2014; Garrett et al., 2021; 2020) . TAMP relies on well-designed state proposals (grasp poses, robot positions, etc.) to sample feasible trajectories, which is computationally inefficient and unscalable for complicated scenarios. Learning-based approaches enable the robot to act according to visual observations. Xia et al. (2021) proposes a hierarchical method for mobile manipulation in iGibson (Xia et al., 2020) , which predicts either a high-level base or arm action by RL policies and executes plans generated by motionplanning to achieve the action. However, the arm action space is specially designed for a primitive action pushing. Sun et al. (2022) develops a real-world RL framework to collect trash on the floor, with separate navigation and grasping policies. Ehsani et al. (2021) ; Ni et al. (2021) train an end-toend RL policy to tackle mobile pick-and-place in ManipulaTHOR (Ehsani et al., 2021) . However, the reward function used to train such an end-to-end policy usually demands careful tuning. For example, Ni et al. (2021) shows that a minor modification (a penalty for disturbance avoidance) can lead to a considerable performance drop. The vulnerability of end-to-end RL approaches restricts scalability. Most prior works in both RL and robotics separate mobile the platform and manipulator, to "reduce the difficulty to solve the inverse kinematics problem of a kinematically redundant system" (Sereinig et al., 2020; Sandakalum & Ang Jr, 2022) . Wang et al. (2020) trains an end-to-end RL policy based on the object pose and proprioception to simultaneously control the base and arm. It focuses on picking a single object up in simple scenes, while our work addresses long-horizon rearrangement tasks that require multiple skills. Szot et al. (2021) adopts a different hierarchical approach for mobile manipulation. It uses taskplanning (Fikes & Nilsson, 1971) to generate high-level symbolic goals, and individual skills are trained by RL to accomplish those goals. It outperforms the monolithic end-to-end RL policy and the classical sense-plan-act robotic pipeline. It is scalable since skills can be composited to solve different tasks, and benefit from progress in individual skill learning (Yu et al., 2020; Mu et al., 2021) . Moreover, different from other benchmarks, the HAB features continuous motor control (base and arm), interaction with articulated objects (opening drawers and fridges), and complicated scene layouts. Thus, we choose the HAB as the platform to study long-horizon mobile manipulation. 2.2 SKILL CHAINING FOR LONG-HORIZON TASKS Szot et al. (2021) observes that sequentially chaining multiple skills suffers from "hand-off" problems, where a preceding skill terminates at a state that the succeeding skill has either never seen during training or is infeasible to solve. Lee et al. (2018) proposes to learn a transition policy to connect primitive skills, but assumes that such a policy can be found through random exploration. Lee et al. (2021) regularizes the terminal state distribution of a skill to be close to the initial set of the following skill, through a reward learned with adversarial training. Most prior skill chaining methods focus on fine-tuning learned skills. In this work, we instead focus on subtask formulation for skill chaining, which directly improves composability and reusability without additional computation.

3.1. HOME ASSISTANT BENCHMARK (HAB)

The Home Assistant Benchmark (HAB) (Szot et al., 2021) includes 3 long-horizon mobile manipulation rearrangement tasks (TidyHouse, PrepareGroceries, SetTable) based on the ReplicaCAD dataset, which contains a rich set of 105 indoor scene layouts. For each episode (instance of task), rigid objects from the YCB (Calli et al., 2015) dataset are randomly placed on annotated supporting surfaces of receptacles, to generate clutter in a randomly selected scene. Here we provide a brief description of these tasks. All the tasks demand onboard sensing instead of privileged information (e.g., ground-truth object positions and navigation map). All the tasks use the GeometricGoal (Batra et al., 2020) specification (s 0 , s * ), which describes the initial 3D (center-of-mass) position s 0 of the target object and the goal position s * . For example, TidyHouse is specified by 5 tuples {(s i 0 , s i * )} i=1...5 .

3.2. SUBTASK AND SKILL

In this section, we present the definition of subtask and skill in the context of reinforcement learning. A long-horizon task can be formulated as a Markov decision process (MDP)foot_2 defined by a tuple (S, A, R, P, I) of state space S, action space A, reward function R(s, a, s ′ ), transition distribution P (s ′ |s, a), initial state distribution I. A subtask ω is a smaller MDP (S, A ω , R ω , P, I ω ) derived from the original MDP of the full task. A skill (or policy), which maps a state s ∈ S to an action a ∈ A, is learned for each subtask by RL algorithms. Note that s 0 is constant per episode instead of a tracked object position. Hence, the target object may not be located at s 0 at the beginning of a skill, e.g., picking an object from an opened drawer. Next, we will illustrate how these skills are chained in the HAB.

3.3. SKILL CHAINING

Given a task decomposition, a hierarchical approach also needs to generate high-level actions to select a subtask and perform the corresponding skill. Task planning (Fikes & Nilsson, 1971 ) can be applied to find a sequence of subtasks before execution, with perfect knowledge of the environment. An alternative is to learn high-level actions through hierarchical RL. In this work, we use the subtask sequences generated by a perfect task planner (Szot et al., 2021) . Here we list these sequences, to highlight the difficulty of tasksfoot_3 .

4. SUBTASK FORMULATION AND SKILL LEARNING FOR MOBILE MANIPULATION

Following the proposed principles (composability, achievability, reusability), we revisit and reformulate subtasks defined in the Home Assistant Benchmark (HAB). The core idea is to enlarge the initial states of manipulation skills to encompass the terminal states of the navigation skill, given our observation that the navigation skill is usually more robust to initial states. However, manipulation skills (Pick, Place, Open drawer, Close drawer) in Szot et al. (2021) , are stationary. The composability of a stationary manipulation skill is restricted, since its feasible initial states are limited due to kinematic constraints. For instance, the robot can not open the drawer if it is too close or too far from the drawer. Therefore, these initial states need to be carefully designed given the trade-off between composability and achievability, which is not scalable and flexible. On the other hand, the navigation skill, which is learned to navigate to the start of manipulation skills, is also restricted by stationary constraints, since it is required to precisely terminate at a small set of "good" locations for manipulation. To this end, we propose to replace stationary manipulation skills with mobile counterparts. Thanks to mobility, mobile manipulation skills can have better composability without sacrificing much achievability. For example, a mobile manipulator can learn to first get closer to the target and then manipulate, to compensate for errors from navigation. It indicates that the initial states can be designed in a more flexible way, which also enables us to design a better navigation reward to facilitate learning. In the context of mobile manipulation, the initial state of a skill consists of the robot base position, base orientation, and joint positions. For simplicity, we do not discuss the initial states of rigid and articulated objects in the scene, which are usually defined in episode generation. Moreover, we follow previous works (Szot et al., 2021; Lee et al., 2021) to initialize the arm at its resting position and reset it after each skill in skill chaining. Such a reset operation is common in robotics (Garrett et al., 2020) . Each skill is learned to reset the arm after accomplishing the subtask as in Szot et al. (2021) . Furthermore, for base orientation, we follow the heuristic in Szot et al. (2021) to make the robot face the target position s 0 or s * .

4.1. MANIPULATION SKILLS WITH MOBILITY

We first present how initial base positions are generated in previous works. For stationary manipulation, a feasible base position needs to satisfy several constraints, e.g., kinematic (the target is reachable) and collision-free constraints. Szot et al. (2021) uses heuristics to determine base positions. For Pick, Place without containers (fridge and drawer), a navigable position closest to the target position is selected. For Pick, Place with containers, a fixed position relative to the container is selected. For Open, Close, a navigable position is randomly selected from a handcrafted region relative to each container. Noise is added to base position and orientation in addition, and infeasible initial states are rejected by constraints. See Fig 2 for examples. The above example indicates the difficulty and complexity to design feasible initial states for stationary manipulation. One naive solution is to enlarge the initial set with infeasible states, but this can hurt learning as shown later in Sec 5.4. Besides, rejection sampling can be quite inefficient in this case, and Szot et al. (2021) actually computes a fixed number of feasible initial states offline. Manipulation Skills with Mobility. To this end, we propose to use mobile manipulation skills instead. The original action space (only arm actions) is augmented with base actions. We devise a unified and efficient pipeline to generate initial base positions. Concretely, we first discretize the floor map with a resolution of 5 × 5cm 2 , and get all navigable (grid) positions. Then, different candidates are computed from these positions based on subtasks. Candidates are either within a radius (e.g., 2m) around the target position for Pick, Place, or a region relative to the container for 

4.2. NAVIGATION SKILL WITH REGION-GOAL NAVIGATION REWARD

The navigation skill is learned to connect different manipulation skills. Hence, it needs to terminate within the set of initial achievable states of manipulation skills. We follow Szot et al. (2021) to randomly sample a navigable base position and orientation as the initial state of navigation skill. The challenge is how to formulate the reward function, which implicitly defines desirable terminal states. A common navigation reward (Wijmans et al., 2019) is the negative change of geodesic distance to a single 2D goal position on the floor. Szot et al. (2021) extends it for mobile manipulation, which introduces the negative change of angular distance to the desired orientation (facing the target). The resulting reward function, r t (s, a), for state s and action a is the following (Eq 1): This reward has several drawbacks: 1) A single 2D goal needs to be assigned, which should be an initial base position of manipulation skills. It is usually sampled with rejection, as explained in Sec 4.1. It ignores the existence of multiple reasonable goals, introduces ambiguity to the reward (hindering training), and leads the skill to memorize (hurting generalization). 2) There is a hyperparameter D, which defines the region where the angular term ∆ ang is considered. However, it can lead the agent to learn the undesirable behavior of entering the region with a large angular distance, e.g., backing onto the target. Region-Goal Navigation Reward. To this end, we propose a region-goal navigation reward for training the navigation skill. Inspired by object-goal navigation, we use the geodesic distancefoot_4 between the robot and a region of 2D goals on the floor instead of a single goal. Thanks to the flexibility of our mobile manipulation skills, we can simply reuse the candidates (Sec 4.1) for their initial base positions as the navigation goals. However, these candidates are not all collision-free. Thus, we add a collision penalty r col = λ col C t to the reward, where C t is the current collision force and λ col is a weight. Besides, we simply remove the angular term, and find that the success reward is sufficient to encourage correct orientation. Our region-goal navigation reward is as follows: r t (s, a) = -∆ geo (g) -λ ang ∆ ang I [d geo t (g)≤ D] + λ succ I [ r t (s, a) = -∆ geo ({g}) + λ succ I [d geo t ({g})≤D∧d ang t ≤Θ] -r col -r slack (2) 5 EXPERIMENTS

5.1. EXPERIMENTAL SETUP

We use the ReplicaCAD dataset and the Habitat 2.0 simulator (Szot et al., 2021) for our experiments. The ReplicaCAD dataset contains 5 macro variations, with 21 micro variations per macro variationfoot_5 . We hold out 1 macro variation to evaluate the generalization of unseen layouts. For the rest of the 4 macro variations, we split 84 scenes into 64 scenes for training and 20 scenes to evaluate the generalization of unseen configurations (object and goal positions). For each task, we generate 6400 episodes (64 scenes) for training, 100 episodes (20 scenes) to evaluate cross-configuration generalization, and another 100 episodes (the hold-out macro variation) to evaluate cross-layout generalization. The robot is a Fetch (Robotics, 2022) mobile manipulator with a 7-DoF arm and a parallel-jaw gripper. See Appendix B for more details about the setup and dataset generation. Observation space: The observation space includes head and arm depth images (128 × 128), arm joint positions (7-dim), end-effector position (3-dim) in the base frame, goal positions (3-dim) in both base and end-effector frames, as well as a scalar to indicate whether an object is held. The goal position, depending on subtasks, can be either the initial or desired position of the target object. We assume a perfect GPS+Compass sensor and proprioceptive sensors as in Szot et al. (2021) , which are used to compute the relative goal positions. For the navigation skill, only the head depth image and the goal position in the base frame are used. Action space: The action space is a 10-dim continuous space, including 2-dim base action (linear forwarding and angular velocities), 7-dim arm action, and 1-dim gripper action. Grasping is abstract as in Batra et al. (2020) ; Szot et al. (2021) ; Ehsani et al. (2021) . If the gripper action is positive, the object closest to the end-effector within 15cm will be snapped to the gripper; if negative, the gripper will release any object held. For the navigation skill, we use a discrete action space, including a stop action, as in Yokoyama et al. (2021) ; Szot et al. (2021) . A discrete action will be converted to continuous velocities to move the robot, while arm and gripper actions are masked out. Hyper-parameters: We train each skill by the PPO (Schulman et al., 2017) algorithm. The visual observations are encoded by a 3-layer CNN as in Szot et al. (2021) . The visual features are concatenated with state observations and previous action, followed by a 1-layer GRU and linear layers to output action and value. Each skill is trained with 3 different seeds. See Appendix C.1 for details. Metrics: Each HAB task consists of a sequence of subtasks to accomplish, as illustrated in Sec 3.3. The completion of a subtask is conditioned on the completion of its preceding subtask. We report progressive completion rates of subtasks, and the completion rate of the last subtask is thus the success rate of the full task. For each evaluation episode, the robot is initialized at a random base position and orientation without collision, and its arm is initialized at the resting position. The completion rate is averaged over 9 different runsfoot_6 .

5.2. BASELINES

We denote our method by M3, short for a multi-skill mobile manipulation pipeline where mobile manipulation skills (M) are chained by the navigation skill trained with our region-goal navigation reward (R). We compare our method with several RL baselines. All baselines follow the same experimental setup in Sec 5.1 unless specified. We refer readers to Szot et al. ( 2021 Mobile manipulation skills + point-goal navigation reward (M+P): Compared to our M3, this baseline does not use the region-goal navigation reward. It demonstrates the effectiveness of proposed mobile manipulation skills. Note that the point-goal navigation reward is designed for the start of stationary manipulation skills.

5.3. RESULTS

Fig 3 shows the progressive completion rates of different methods on all tasks. Our method M3 achieves an average success rate of 71.2% in the cross-configuration setting, and 55.0% in the crosslayout setting, over all 3 tasks. It outperforms all the baselines in both settings, namely mono (1.8%/1.8%), S+P (57.4%/31.1%) and M+P (64.9%/36.2%). First, all the modular approaches show much better performance than the monolithic baseline, which verifies the effectiveness of modular approaches for long-horizon mobile manipulation tasks. Mobile manipulation skills are in general superior to stationary ones (M+P vs.S+P). Fig 4 provides an example where mobile manipulation skills can compensate for imperfect navigation. Furthermore, our region-goal navigation reward can reduce the ambiguity of navigation goals to facilitate training (see training curves in Appendix C). Since it does not require the policy to memorize ambiguous goals, the induced skill shows better generalizability, especially in the cross-layout setting (55.0% for M3 vs.36.2% for M+P).

5.4. ABLATION STUDIES

We conduct several ablation studies to show that mobile manipulation skills are more flexible to formulate than stationary ones, and to understand the advantage of our navigation reward. Can initial states be trivially enlarged? We conduct experiments to understand to what extent we can enlarge the initial states of manipulation skills given the trade-off between achievability and composability. In the S(L)+P experiment, we simply replace the initial states of stationary manipulation skills with those of mobile ones. The success rates of stationary manipulation skills on subtasks drop by a large margin, e.g., from 95% to 45% for Pick on TidyHouse. Fig 5 shows that S(L)+P (37.7%/18.1%) is inferior to both S+P (57.4%/31.1%) and M+P (64.9%/36.2%). It indicates that stationary manipulation skills have a much smaller set of feasible initial states compared to mobile ones, and including infeasible initial states during training can hurt performance significantly. We also study the impact of initial state distribution on mobile manipulation skills in Appendix F. Is the collision penalty important for the navigation skill? Our region-goal navigation reward benefits from unambiguous region goals and the collision penalty. We add the collision penalty to the point-goal navigation reward (Eq 1) in S+P(C) and M+P(C) experiments. and M+P(C) (67.9%/49.2%) vs.M+P (64.9%/36.2%). A collision-aware navigation skill can avoid disturbing the environment, e.g., accidentally closing the fridge before placing an object in it. Besides, M+P(C) is still inferior to our M3 (71.2%/55.0%). It implies that reducing the ambiguity of navigation goals helps learn more robust and generalizable navigation skills.

6. CONCLUSION AND LIMITATIONS

In this work, we present a modular approach to tackle long-horizon mobile manipulation tasks in the Home Assistant Benchmark (HAB), featuring mobile manipulation skills and the region-goal navigation reward. Given the superior performance, our approach can serve as a strong baseline for future study. Besides, the proposed principles (achievability, composability, reusability) can serve as a guideline about how to formulate meaningful and reusable subtasks. However, our work is still limited to abstract grasp and other potential simulation defects. We leave fully dynamic simulation and real-world deployment to future work.

A OVERVIEW

Compared to the original implementation (Szot et al., 2021) , our implementation benefits from repaired assets (Sec B), improved reward functions and better training schemes (Sec C). Other differences include observation and action spaces. We introduce in observations the target positions in the base frame in addition to those in the end-effector frame. The arm action is defined in the joint configuration space (7-dim) rather than the end-effector Euclidean space (3-dim with no orientation). B DATASET AND EPISODES Szot et al. (2021) keeps updating the ReplicaCAD dataset. The major fix is "minor furniture layout modifications in order to better accommodate robot access to the full set of receptacles"foot_7 . The agent radius is also decreased from 0.4m to 0.3m to generate navigation meshes with higher connectivity. Besides, Szot et al. (2021) also improves the episode generatorfoot_8 to ensure stable initialization of objects. Those improvements eliminate most unachievable episodes in the initial version. The episodes used in our experiments are generated with the ReplicaCAD v1.4 and the latest habitatlabfoot_9 . Cross-configuration and cross-layout settings are the same except for scene layouts. In the crossconfiguration setting, test scene layouts (micro variations) are different but similar to training ones. In the cross-layout setting, test scene layouts (macro variations) are significantly different from training ones. Each macro variation has a different, semantically plausible layout of large furniture (e.g., kitchen counter and fridge) while each micro variation is generated through perturbing small furniture (e.g., chairs and tables). Thus, the cross-layout setting demands stronger generalization on scene layouts. For TidyHouse, each episode includes 20 clutter objects and 5 target objects along with their goal positions, located at 7 different receptacles (chair, 2 tables, tv stand, two kitchen counters, sofa). For PrepareGroceries, each episode includes 21 clutter objects located at 8 different receptacles (the 7 receptacles used in TidyHouse and the top shelf of the fridge) and 1 clutter object located at the middle shelf of the fridge. 2 target objects are located at the middle shelf, and each of their goal positions is located at one of two kitchen counters. The third target object is located at one of two kitchen counters, and its goal position is at the middle shelf. SetTable generates episodes similar to PrepareGroceries, except that two target objects, bowl and apple, are initialized at one of 3 drawers and at the middle fridge shelf respectively. Each of their goal positions is located at one of two tables.

C SKILL LEARNING

Each skill is trained to accomplish a subtask and reset its end-effector at the resting position. The robot arm is first initialized with predefined resting joint positions, such that the corresponding resting position of the end-effector is (0.5, 1.0, 0.0) in the base framefoot_10 . The initial end-effector position is then perturbed by a Gaussian noise N (0, 0.025) clipped at 0.05m. The base position is perturbed by a Gaussian noise N (0, 0.1) truncated at 0.2m. The base orientation is perturbed by a Gaussian noise N (0, 0.25) truncated at 0.5 radian. The maximum episode length is 200 steps for all the manipulation skills, and 500 steps for the navigation skill. The episode terminates on success or failure. We use the same reward function for both stationary and mobile manipulation skills, unless specified. For all skills, d o ee is the distance between the end-effector and the object, d r ee is the distance between the end-effector and the resting position, d h ee is the distance between the end-effector and a predefined manipulation handle (a 3D position) of the articulated object, d g a is the distance between the joint position of the articulated object and the goal joint position. ∆ b a = d b a (t -1) -d b a (t) stands for the (negative) change in distance between a and b. For example, ∆ o ee is the change in distance between the end-effector and the object. I holding indicates if the robot is holding an (correct) object or handle. I succ indicates the task success. C t refers to the current collision force, and C 1:t stands for the accumulated collision force. The 7-dim arm action stands for the delta joint positions added to the current target joint positions of the PD controller. The input arm action is assumed to be normalized to [-1, 1], and will be scaled by 0.025 (radian). The 2-dim base action stands for linear and angular velocities. The base movement in the Habitat 2.0 is implemented by kinematically setting the robot's base transformation. The collision between the robot base and navigation meshes is taken into consideration. The input base action is assumed to be normalized to [-1, 1], and will be scaled by 3 (navigation skill) or 1.5 (manipulation skills). For the navigation skill, we follow Szot et al. (2021) to use a discrete action space and translate the discrete action into the continuous one. Concretely, the (normalized) linear velocity from -0.5 to 1 is discretized into 4 choices ({-0.5, 0, 0.5, 1}), and the (normalized) angular velocity from -1 to 1 is discretized into 5 choices (({-1, -0.5, 0, 0.5, 1}). The stop action corresponds to the discrete action representing zero velocities. Pick(s 0 ) • Objective: pick the object initialized at s 0 • Initial base position (noise is applied in addition): -Stationary: the closest navigable position to s 0 -Mobile: a randomly selected navigable position within 2m of s 0 • Reward: I pick indicates whether the correct object is picked and I wrong indicates whether a wrong object is picked. r t = 4∆ o ee I !holding + I pick + 4∆ r ee I holding + 2.5I succ -max(0.001C t , 0.2) -I [C1:t>5000] -I wrong -I [d o ee >0 .09] I holding -0.002 • Success: The robot is holding the target object and the end-effector is within 5cm of the resting position. I succ = I holding ∧ d r ee ≤ 0.05 • Failure: -I [C1:t>5000] = 1: The accumulated collision force is larger than 5000N . -I wrong = 1: A wrong object is picked. -I [d o ee >0.09] I holding = 1: The held object slides off the gripper. • Observation space: -Depth images from head and arm cameras. -The current arm joint positions. -The current end-effector position in the base frame. -Whether the gripper is holding anything. -The starting position s 0 in both the base and end-effector frame. • Action space: The gripper is disabled to release. Place(s * ) • Objective: place the held object at s * • Initial base position (noise is applied in addition): -Stationary: the closest navigable position to s * -Mobile: a randomly selected navigable position within 2m of s * • Reward: I place indicates whether the object is released within 15cm of the goal position, and I drop indicates whether the object is released beyond 15cm. r t = 4∆ s * o I holding + I place + 4∆ r ee I !holding + 2.5I succ -min(0.001C t , 0.2) -I [C1:t>7500] -I drop -I [d o ee >0 .09] I holding -0.002 • Success: The object is within 15cm of the goal position and the end-effector is within 5cm of the resting position. I succ = d s * o ≤ 0.15 ∧ I !holding ∧ d r ee ≤ 0.05 • Failure: -I [C1:t>7500] = 1: The accumulated collision force is larger than 7500N . -I drop = 1: The object is released beyond 15cm of the goal position. -I [d o ee >0.09] I holding = 1: The held object slides off the gripper. • Observation space: -Depth images from head and arm cameras. -The current arm joint positions. -The current end-effector position in the base frame. -Whether the gripper is holding anything. -The goal position s * in both the base and end-effector frame. • Action space: The gripper is disabled to grasp after releasing the object.

Open drawer(s)

• Objective: open the drawer containing the object initialized at s. The goal joint position of the drawer is g = 0.45m. • Initial base position (noise is applied in addition): - -Depth images from head and arm cameras. -The current arm joint positions. -The current end-effector position in the base frame. -Whether the gripper is holding anything. -The starting position s in both the base and end-effector frame.

Close drawer(s)

• Objective: close the drawer containing the object initialized at s. The goal joint position is g = 0m. • Initial joint position: q a ∈ [0.4, 0.5], where q a is the joint position of the target drawer. A random subset of other drawers are slightly open (q ′ a ≤ 0.1). • Initial base position (noise is applied in addition): Open fridge(s) • Objective: open the fridge containing the object initialized at s. The goal joint position is g = π 2 . • Initial base position (noise is applied in addition): a navigable position randomly selected within a [0.933, -1.5] × [1.833, 1.5] region in front of the fridge. • Reward: I open = g -q a > 0.15, where q a is the joint position (radian) of the fridge. To avoid the robot from penetrating the fridge due to simulation defects, we add a collision penalty but excludes collision between the end-effector and the fridge. -Depth images from head and arm cameras. -The current arm joint positions. -The current end-effector position in the base frame. -Whether the gripper is holding anything. -The starting position s in both the base and end-effector frame.

Close fridge(s)

• Objective: close the fridge containing the object initialized at s. The goal joint position is g = 0. • Initial joint position: q a ∈ [ π 2 -0.15, 2.356], where q a is the joint position of the target fridge. 

Navigate(s) (point-goal)

• Objective: navigate to the start of other skills specified by s • Reward: refer to Eq 1. r slack = 0.002, D = 0.9, λ ang = 0.25, λ succ = 2.5 • Success: The robot is within 0.3 meter of the goal, 0.5 radian of the target orientation, and has called the stop action at the current time step. • Observation space: -Depth images from the head camera. -The goal position s * in the base frame. 

D MONOLITHIC BASELINE

For the monolithic baseline, a monolithic RL policy is trained for each HAB task. During training, the policy only handles one randomly selected target object, e.g., picking and placing one object in TidyHouse. During inference, the policy is applied to each target object. We use the same observation space, action space and training scheme as those for our mobile manipulation skills. The main challenge is how to formulate a reward function for those complicated long-horizon HAB tasks that usually require multiple stages. We follow Szot et al. (2021) to composite reward functions for individual skills, given the sequence of subtasks. Concretely, at each time step during training, we infer the current subtask given perfect knowledge of the environment, and use the reward function of the corresponding skill. To ease training, we remove the collision penalty and do not terminate the episode due to collision. Besides, we use the region-goal navigation reward for the navigation subtask. Thanks to our improved reward functions and better training scheme, our monolithic RL baseline is much better than the original implementation in Szot et al. (2021) . However, although able to move the object to its goal position, the policy never learns to release the object to complete the subtask Place during training. It might be due to exploration difficulty since Place is the last subtask in a long sequence and previous subtasks all require the robot not to release. To boost its performance, we force the gripper to release anything held at the end of execution during evaluation.

E EVALUATION E.1 SEQUENTIAL SKILL CHAINING

For evaluation, skills are sequentially executed in the order of their corresponding subtasks, as described in Sec 3.3. The main challenge is how to terminate a skill without privileged information. Basically, each skill will be terminated if its execution time exceeds its max episode length (200 steps for manipulation skills and 500 steps for the navigation skill). The termination condition of Pick is that an object is held and the end-effector is within 15cm of the resting position, which can be computed based on proprioception only. The gripper is disabled to release for Pick. The termination condition of Place is that the gripper holds nothing and the end-effector is within 15cm of the resting position. The gripper is disabled to grasp for Place. Besides, anything held will be released when Place terminates. For Open and Close, we use a heuristic from Szot et al. (2021) : the skill will terminate if the end-effector is within 15cm of the resting position and it has moved at least 30cm away from the resting position during execution. Navigate terminates when it calls the stop action. Furthermore, since the manipulation skills only learn to reset its end-effector, we apply an additional operation to reset the whole arm after each skill. This reset operation is achieved by setting predefined joint positions as the target of the robot's PD controller.

E.2 PROGRESSIVE COMPLETION RATE

In this section, we describe how progressive completion rates are computed. The evaluation protocol is the same as Szot et al. (2021) (see its Appendix F), and here we phrase it in a way more friendly to readers with little knowledge of task planning and Planning Domain Definition Language (PDDL). To partially evaluate a HAB task, we divide a full task into a sequence of stages (subgoals). For example, TidyHouse can be considered to consist of pick 0, place 0, pick 1, etc.Each stage can correspond to multiple subtasks. For example, the stage pick i includes Navigate(s i 0 ) and Pick(s i 0 ). Thus, to be precise, the completion rate is computed based on stages instead of subtasks. We define a set of predicates to measure whether the goal of a stage is completed. A stage goal is completed if all the predicates associated with it are satisfied. The predicates are listed as follows: • holding(target_obj|i): The robot is holding the i-th object. • at(target_obj_pos|i,target_goal_pos|i): The i-th object is within 15cm of its goal position. Figure 8 : Progressive completion rates for HAB (Szot et al., 2021) 

G MORE QUANTITATIVE METRICS

In this section, we present more quantitative metrics in addition to progressive completion rates for the main experiments on 3 HAB tasks. We report the number of successfully placed objects and the average distance between objects and goals in 11 -"at(target_obj_pos|0,target_goal_pos|0)" 12 -"at(target_obj_pos|1,target_goal_pos|1)" 13 pick_2: 14 -"holding(target_obj|2)" 15 -"at(target_obj_pos|0,target_goal_pos|0)"

16

-"at(target_obj_pos|1,target_goal_pos|1)" 17 place_2: 18 -"not_holding()" 19 -"at(target_obj_pos|0,target_goal_pos|0)"

20

-"at(target_obj_pos|1,target_goal_pos|1)"

21

-"at(target_obj_pos|2,target_goal_pos|2)" 22 pick_3:

23

-"holding(target_obj|3)"

24

-"at(target_obj_pos|0,target_goal_pos|0)"

25

-"at(target_obj_pos|1,target_goal_pos|1)"

26

-"at(target_obj_pos|2,target_goal_pos|2)" 27 place_3:

28

-"not_holding()"

29

-"at(target_obj_pos|0,target_goal_pos|0)"

30

-"at(target_obj_pos|1,target_goal_pos|1)"

31

-"at(target_obj_pos|2,target_goal_pos|2)"

32

-"at(target_obj_pos|3,target_goal_pos|3)" 33 pick_4:

34

-"holding(target_obj|4)"

35

-"at(target_obj_pos|0,target_goal_pos|0)"

36

-"at(target_obj_pos|1,target_goal_pos|1)"

37

-"at(target_obj_pos|2,target_goal_pos|2)"

38

-"at(target_obj_pos|3,target_goal_pos|3)" 39 place_4:

40

-"not_holding()"

41

-"at(target_obj_pos|0,target_goal_pos|0)"

42

-"at(target_obj_pos|1,target_goal_pos|1)"

43

-"at(target_obj_pos|2,target_goal_pos|2)"

44

-"at(target_obj_pos|3,target_goal_pos|3)"

45

-"at(target_obj_pos|4,target_goal_pos|4)" Listing 1: Stage goals and their associated predicates defined for TidyHouse. The stages are listed in the order for progressive evaluation. 11 -"at(target_obj_pos|0,target_goal_pos|0)"

12

-"at(target_obj_pos|1,target_goal_pos|1)" 13 pick_2: 14 -"holding(target_obj|2)" 15 -"at(target_obj_pos|0,target_goal_pos|0)"

16

-"at(target_obj_pos|1,target_goal_pos|1)" 17 place_2: 18 -"not_holding()"

19

-"at(target_obj_pos|0,target_goal_pos|0)"

20

-"at(target_obj_pos|1,target_goal_pos|1)"

21

-"at(target_obj_pos|2,target_goal_pos|2)" Listing 2: Stage goals and their associated predicates defined for PrepareGroceries. The stages are listed in the order for progressive evaluation. -"closed_drawer(target_marker|0)" 13 -"at(target_obj_pos|0,target_goal_pos|0)"

14

-"opened_fridge(target_marker|1)" 15 pick_1:

16

-"closed_drawer(target_marker|0)"

17

-"at(target_obj_pos|0,target_goal_pos|0)"

18

-"opened_fridge(target_marker|1)"

19

-"holding(target_obj|1)" 20 place_1:

21

-"closed_drawer(target_marker|0)"

22

-"at(target_obj_pos|0,target_goal_pos|0)"

23

-"not_holding()"

24

-"at(target_obj_pos|1,target_goal_pos|1)" 25 close_1:

26

-"closed_drawer(target_marker|0)"

27

-"at(target_obj_pos|0,target_goal_pos|0)"

28

-"closed_fridge(target_marker|1)"

29

-"at(target_obj_pos|1,target_goal_pos|1)" Listing 3: Stage goals and their associated predicates defined for SetTable. The stages are listed in the order for progressive evaluation.



Project website: https://sites.google.com/view/hab-m3 Codes: https://github.com/Jiayuan-Gu/hab-mobile-manipulation To be precise, the tasks studied in this work are partially observable Markov decision process (POMDP). We only list the subtask sequence of TidyHouse for one object here for illustration. The containers are denoted with subscripts f r (fridge) and dr (drawer) if included in the skill. The geodesic distance to a region can be approximated by the minimum of all the geodesic distances to grid positions within the region. Each macro variation has a different, semantically plausible layout of large furniture (e.g., kitchen counter and fridge) while each micro variation is generated through perturbing small furniture (e.g., chairs and tables). 3 seeds for RL training multiplied by 3 seeds for initial states https://github.com/facebookresearch/habitat-sim/pull/1694 https://github.com/facebookresearch/habitat-lab/pull/764 https://github.com/facebookresearch/habitat-lab/pull/837 The positive x and y axes point forward and upward in Habitat. https://github.com/facebookresearch/habitat-lab/blob/main/habitat_ baselines/rl/models/simple_cnn.py



Figure 2: Initial base positions of manipulation skills. We only show the examples for Pick, Close drawer, Close fridge, as Place, Open drawer, Open fridge share the same initial base positions respectively. Positions are visualized as green points on the floor. The target object in Pick is highlighted by a circle in cyan. Note that the initial base position of Pick(stationary) is a single navigable position closest to the object. Open, Close. Finally, a feasible position is sampled from the candidates with rejection and noise. Compared to stationary manipulation, the rejection rate of our pipeline is much lower, and thus can be efficiently employed on-the-fly during training. See Fig 2 for examples.

Figure 3: Progressive completion rates for HAB Szot et al. (2021) tasks. The x-axis represents progressive subtasks. The y-axis represents the completion rate of each subtask. The mean and standard error for 100 episodes over 9 seeds are reported. Best viewed zoomed.

Figure 4: Qualitative comparison between stationary and mobile manipulation. In this example, the point-goal navigation skill terminates between two drawers (1st image). Mobile manipulation manages to open the correct drawer containing the bowl (last image in the bottom row) while stationary manipulation gets confused and finally opens the wrong drawer (last image in the top row). More qualitative results can be found in Appendix H and on our project website.TidyHouse

Stationary: a navigable position randomly selected within a [0.80, -0.35] × [0.95, 0.35] region in front of the drawer. -Mobile: a navigable position randomly selected within a [0.3, -0.6] × [1.5, 0.6] region in front of the drawer. • Reward: I open = d g a ≤ 0.05 indicates whether the drawer is open. I release indicates whether the handle is released when the drawer is open. I grasp indicates whether the correct handle is grasped. a base is the (2-dim) base action. r t = 2∆ h ee I !open + I grasp + 2∆ g a I holding + I release + 2∆ r ee I open + 2.5I succ -I wrong -I [d h ee >0.2] I holding -I out -0.004∥a base ∥ 1 • Success: The drawer is open, and the end-effector is within 15cm of the resting position. I succ = I open ∧ I !holding ∧ d r ee ≤ 0.15 • Failure: -I wrong = 1: The wrong object or handle is picked. -I [d h ee >0.2] I holding = 1: The grasped handle slides off the gripper. -I out = 1: The robot moves out of a predefined region (a 2m × 3m region in front of the drawer). -I I [open(t-1)∧!open(t)] = 1: The drawer is not open after being opened. -The gripper releases the handle when the drawer is not open (I !open = 1). -∆ g a >= 0.1: The drawer is opened too fast. • Observation space:

Stationary: a navigable position randomly selected within a [0.3, -0.35]×[0.45, 0.35] region in front of the drawer. -Mobile: a navigable position randomly selected within a [0.3, -0.6] × [1.0, 0.6] region in front of the drawer. • Reward: It is almost the same as Open drawer by replacing open with close. I close = d g a ≤ 0.1. • Success: The drawer is closed, and the end-effector is within 15cm of the resting position. • Failure: It is almost the same as Open drawer by replacing open with close, except that the last constraint ∆ g a >= 0.1 is not included.

r t = 2∆ h ee I !open + I grasp + +2∆ g a I holding + I release + ∆ r ee I open + 2.5I succ -I C1:t>5000 -I wrong -I [d h ee >0.2] I holding -I out -0.004∥a base ∥ 1 • Success: The fridge is open, and the end-effector is within 15cm of the resting position. I succ = I open ∧ I !holding ∧ d r ee ≤ 0.15 • Failure: -I wrong = 1: The wrong object or handle is picked. -I [d h ee >0.2] I holding = 1: The grasped handle slides off the gripper. -I out = 1: The robot moves out of a predefined region (a 2m × 3.2m region in front of the fridge). -I I [open(t-1)∧!open(t)] = 1: The fridge is not open after being opened. -The gripper releases the handle when the fridge is not open (I !open = 1). • Observation space:

• Initial base position (noise is applied in addition): a navigable position randomly selected within a [0.933, -1.5] × [1.833, 1.5] region in front of the fridge. • Reward: It is almost the same as Close fridge by replacing open with close. I close = d g a ≤ 0.15. • Success: The fridge is close, and the end-effector is within 15cm of the resting position.

the beginning of an episode. However, the skill Pick needs to pick this object up when the drawer is open and the actual position of the object is different from the starting position. It is inconsistent with other cases when the object is in an open receptacle or the fridge. We observe such ambiguity can hurt performance. See Fig 6 for all task-specific variants of skills.

Figure 9: Qualitative comparison in TidyHouse. In this example, the point-goal navigation skill terminates behind the TV (1st image). The arm is blocked by the TV in stationary manipulation (last image in the top row). The robot manages to move backward and avoid being blocked in mobile manipulation (last image in the middle row). The region-goal navigation skill instead terminates in front of the TV (1st image in the bottom row). H MORE QUALITATIVE RESULTS Fig 9, 10, 11 show more qualitative comparison of different methods. Their animated versions can be found on our project website.

Figure 10: Qualitative comparison in PrepareGroceries. In this example, the point-goal navigation skill accidentally close the fridge (top row). The region-goal navigation skill is able to avoid disturbing the environment due to the collision penalty (bottom row).

Figure 11: Qualitative comparison in SetTable. In this example, the navigation skill terminates at the position where the robot can not reach the target object in the fridge in stationary manipulation. (top row). The robot can move closer to the object and then pick it, to compensate for the navigation skill in mobile manipulation (bottom row).

Move a bowl from a drawer to a table, and move a fruit from the fridge to the bowl on the table. Both the drawer and fridge are closed initially. The task requires interaction with articulated objects as well as picking objects from containers.

is the target orientation. Note that the 2D goal on the floor is different from the 3D goal specification for manipulation subtasks. I [d geo t ≤ D] is an indicator of whether the agent is close enough to the 2D goal, where D is a threshold. I [d geo ≤Θ] is an indicator of navigation success, where D and Θ are thresholds for geodesic and angular distances. r slack is a slack penalty. λ ang , λ succ are hyper-parameters.

tasks. The x-axis represents progressive subtasks. The y-axis represents the completion rate of each subtask. Results of ablation experiments are presented with solid lines. The mean and standard error for 100 episodes over 9 seeds are reported.Besides, we extend the S(L)+P experiment described in Sec 5.4, where we simply replace the initial states of stationary manipulation skills with those of mobile ones. We reject the initial states that the target is not reachable due to the kinematic constraint. The constraint is checked via inverse kinematics (IK). The extended experiment is denoted by S(L+IK)+P. Fig8shows the quantitative results. The overall success rate of S(L+IK)+P is 44.7%/21.1% in the cross-configuration/crosslayout setting. It indicates that increasing the feasible initial states help stationary manipulation skills compared to S(L)+P (37.7%/18.1%), but still has a large performance drop compared to S+P (57.4%/31.1%). One possible reason is that although the target might be IK-reachable, it can be hard to achieve with stationary manipulation skills due to collision with other objects. However, mobile manipulation skills can first navigate to better locations with fewer obstacles in the front.

Table1 and 2. These metrics are analogous to %FIXEDSTRICT and %E inWeihs et al. (2021). The number of successfully placed objects for HAB tasks. The metrics in the crossconfiguration/cross-layout setting are reported. The number of objects to place is shown along with the name of each task.

Average distance between objects and goals for HAB tasks. The metrics in the crossconfiguration/cross-layout setting are reported. Note that the average distance is sensitive to outliers.

Navigate(s) (region-goal)

• Objective: navigate to the start of other skills specified by s • Reward: refer to Eq 2. r slack = 0.002, r col = min(0.001C t , 0.2), λ succ = 2.5 • Success: The robot is within 0.1 meter of any goal in the region, 0.25 radian of the target orientation at the current position, and has called the stop action at the current time step. • Observation space:-Depth images from the head camera.-The goal position s * in the base frame.

C.1 PPO HYPER-PARAMETERS

Our PPO implementation is based on the habitat-lab. The visual encoder is a simple CNN 10 . The coefficients of value and entropy losses are 0.5 and 0 respectively. We use 64 parallel environments and collect 128 transitions per environment to update the networks. We use 2 mini-batches, 2 epochs per update, and a clipping parameter of 0.2 for both policy and value. The gradient norm is clipped at 0.5. We use the Adam optimizer with a learning rate of 0.0003. The linear learning rate decay is enabled. The mean of the Gaussian action predicted by the policy network is activated by tanh. The (log) standard deviation of the Gaussian action, which is an input-independent parameter, is initialized as -1.0. 

C.2 OTHER IMPLEMENTATION DETAILS

The PPO algorithm implemented by the habitat-lab does not distinguish the termination of the environment (MDP) and the truncation due to time limit. We fix this issue in our implementation. Furthermore, we separately train all the skills for each HAB task to avoid potential ambiguity. For example, the starting position of an object in the drawer is computed when the drawer is closed at During evaluation, we evaluate whether the current stage goal is completed at each time step. If the current stage goal is completed, we progress to the next stage. Hence, the completion rate monotonically decreases. Listings 1, 2, 3 present the stages defined for each HAB task and the predicates associated with each stage. Note that the stage goal place i only indicates that the object has been released at its goal position, but the placement can be unstable (e.g., the object falls down the table), which can lead to the failure of the next stage. Besides, due to abstract grasp, it is difficult to place the object stably since the pose of the grasped object can not be fully controlled. Therefore, we modify the objective of SetTable to make the task achievable given abstract grasp. Concretely, instead of placing the fruit in the bowl, the robot only needs to place the fruit picked from the fridge at a goal position on the table.

F MORE ABLATION STUDIES

In this section, we study the impact of different initial state distributions on mobile manipulation skills. We study the impact of different initial state distributions on mobile manipulation skills. We enlarge initial states by changing the distributions of the initial base position (the radius around the target) and orientation. For reference, the maximum radius around the target is set to 2m in the main experiments (Sec 5). Several experiments are conducted: M(S)+R, M(L1)+R, M(L2)+R, M(L3)+R. M(S)+R, M(L1)+R and M(L2)+R stand for the experiments where the maximum radii around the target are set to 1.5m, 2.5m and 4m respectively. M(L3)+R keeps the radius as 2m, but samples the initial base orientation from [-π, π], instead of using the direction facing towards the target. 

