GROUNDING PLANNABLE LIFTED ACTION MODELS FOR VISUALLY GROUNDED LOGICAL PREDICATES

Abstract

We propose FOSAE++, an unsupervised end-to-end neural system that generates a compact discrete state transition model (dynamics / action model) from raw visual observations. Our representation can be exported to Planning Domain Description Language (PDDL), allowing symbolic state-of-the-art classical planners to perform high-level task planning on raw observations. FOSAE++ expresses states and actions in First-Order Logic (FOL), a superset of so-called object-centric representation. It is the first unsupervised neural system that fully supports FOL in PDDL action modeling, while existing systems are limited to continuous, propositional, or property-based representations, and/or require manually labeled input.

1. INTRODUCTION

Learning a high-level symbolic transition model of an environment from raw input (e.g., images) is a major challenge in the integration of connectionism and symbolism. Doing so without manually defined symbols is particularly difficult as it requires solving both the Symbol Grounding (Harnad, 1990; Taddeo & Floridi, 2005; Steels, 2008) and the Action Model Learning/Acquisition problem. Recently, seminal work by Asai & Fukunaga (2018, Latplan) that learns discrete planning models from images has opened the door to applying symbolic Classical Planning systems to a wide variety of raw, noisy data. Latplan uses discrete variational autoencoders to generate propositional latent states and its dynamics (action model) directly from images. Unlike existing work, which requires several machine learning pipelines (SVM/decision trees) and labeled inputs (e.g., a sequence of high-level options) (Konidaris et al., 2014) , Latplan is an end-to-end unsupervised neural network that requires no manually labeled inputs. Numerous extensions and enhancements have been proposed: Causal InfoGAN (Kurutach et al., 2018) instead uses GAN framework to obtain propositional representations. Latplan's representation was shown to be compatible with symbolic Goal Recognition (Amado et al., 2018) . First-Order State AutoEncoder (Asai, 2019, FOSAE) extends Latplan to generate predicate symbols. Cube-Space AutoEncoder (Asai & Muise, 2020, CSAE) regularized the latent space to a particular form which directly exports to a learned propositional PDDL model (Fikes et al., 1972) . Discrete Sequential Application of Words (DSAW) learns a plannable propositional word embedding from a natural language corpus (Asai & Tang, 2020). In this paper, we obtain a lifted action model expressed in First-Order Logic (FOL), which is a superset of object-centric (property-based) representation that Machine Learning community recently began to pay attention tofoot_0 , but has long been the central focus of the broader AI community. In propositional action models, the environment representation is a fixed-sized binary array and does not transfer to a different or a dynamically changing environment with a varying number of objects. In contrast, lifted FOL representations are generalized over objects and environments, as we demonstrate in Blocksworld with different number of blocks, or Sokoban with different map sizes. We propose Lifted First-Order Space AutoEncoder (FOSAE++) neuro-symbolic architecture, which learns a lifted PDDL action model by integrating and extending the FOSAE, the CSAE and the Neural Logic Machine (Dong et al., 2019, NLM) architectures. The overall task of our system is illustrated in Fig. 1 . The system takes a transition dataset containing a set of pairs of raw observations which are single time step away. Each observation consists of multiple visual segmentations of the objects. It learns a lifted action model of the environment



e.g., ICML workshop on Object-Oriented Learning https://oolworkshop.github.io/ 1

