GEOMETRY OF PROGRAM SYNTHESIS

Abstract

We present a new perspective on program synthesis in which programs may be identified with singularities of analytic functions. As an example, Turing machines are synthesised from input-output examples by propagating uncertainty through a smooth relaxation of a universal Turing machine. The posterior distribution over weights is approximated using Markov chain Monte Carlo and bounds on the generalisation error of these models is estimated using the real log canonical threshold, a geometric invariant from singular learning theory.

1. INTRODUCTION

The idea of program synthesis dates back to the birth of modern computation itself (Turing, 1948) and is recognised as one of the most important open problems in computer science (Gulwani et al., 2017) . However, there appear to be serious obstacles to synthesising programs by gradient descent at scale (Neelakantan et al., 2016; Kaiser & Sutskever, 2016; Bunel et al., 2016; Gaunt et al., 2016; Evans & Grefenstette, 2018; Chen et al., 2018) and these problems suggest that it would be appropriate to make a fundamental study of the geometry of loss surfaces in program synthesis, since this geometry determines the learning process. To that end, in this paper we explain a new point of view on program synthesis using the singular learning theory of Watanabe (2009) and the smooth relaxation of Turing machines from Clift & Murfet (2018) . In broad strokes this new geometric point of view on program synthesis says: • Programs to be synthesised are singularities of analytic functions. If U ⊆ R d is open and K : U -→ R is analytic, then x ∈ U is a critical point of K if ∇K(x) = 0 and a singularity of the function K if it is a critical point where K(x) = 0. • The Kolmogorov complexity of a program is related to a geometric invariant of the associated singularity called the Real Log Canonical Threshold (RLCT). This invariant controls both the generalisation error and the learning process, and is therefore an appropriate measure of "complexity" in continuous program synthesis. See Section 3. • The geometry has concrete practical implications. For example, a MCMC-based approach to program synthesis will find, with high probability, a solution that is of low complexity (if it finds a solution at all). We sketch a novel point of view on the problem of "bad local minima" (Gaunt et al., 2016) based on these ideas. See Section 4. We demonstrate all of these principles in experiments with toy examples of synthesis problems. Program synthesis as inference. We use Turing machines, but mutatis mutandis everything applies to other programming languages. Let T be a Turing machine with tape alphabet Σ and set of states Q and assume that on any input x ∈ Σ * the machine eventually halts with output T (x) ∈ Σ * . Then to the machine T we may associate the set {(x, T (x))} x∈Σ * ⊆ Σ * × Σ * . Program synthesis is the study of the inverse problem: given a subset of Σ * × Σ * we would like to determine (if possible) a Turing machine which computes the given outputs on the given inputs. If we presume given a probability distribution q(x) on Σ * then we can formulate this as a problem of statistical inference: given a probability distribution q(x, y) on Σ * × Σ * determine the most likely machine producing the observed distribution q(x, y) = q(y|x)q(x). If we fix a universal Turing machine U then Turing machines can be parametrised by codes w ∈ W code with U(x, w) = T (x) for all x ∈ Σ * . We let p(y|x, w) denote the probability of U(x, w) = y (which is either zero or one) so that solutions to the synthesis problem are in bijection with the zeros of the Kullback-Leibler divergence between the true distribution and the model K(w) = q(y|x)q(x) log q(y|x) p(y|x, w) dxdy . (1) So far this is just a trivial rephrasing of the combinatorial optimisation problem of finding a Turing machine T with T (x) = y for all (x, y) with q(x, y) > 0. Smooth relaxation. One approach is to seek a smooth relaxation of the synthesis problem consisting of an analytic manifold W ⊇ W code and an extension of K to an analytic function K : W -→ R so that we can search for the zeros of K using gradient descent. Perhaps the most natural way to construct such a smooth relaxation is to take W to be a space of probability distributions over W code and prescribe a model p(y|x, w) for propagating uncertainty about codes to uncertainty about outputs (Gaunt et al., 2016; Evans & Grefenstette, 2018) . The particular model we choose is based on the semantics of linear logic (Clift & Murfet, 2018) . Supposing that such a smooth relaxation has been chosen together with a prior ϕ(w) over W , smooth program synthesis becomes the study of the statistical learning theory of the triple (p, q, ϕ). There are perhaps two primary reasons to consider the smooth relaxation. Firstly, one might hope that stochastic gradient descent or techniques like Markov chain Monte Carlo will be effective means of solving the original combinatorial optimisation problem. This is not a new idea (Gulwani et al., 2017, §6) but so far its effectiveness for large programs has not been proven. Independently, one might hope to find powerful new mathematical ideas that apply to the relaxed problem and shed light on the nature of program synthesis. This is the purpose of the present paper. Singular learning theory. We denote by W 0 = {w ∈ W | K(w) = 0} so that W 0 ∩ W code ⊆ W 0 ⊆ W where W 0 ∩W code is the discrete set of solutions to the original synthesis problem. We refer to these as the classical solutions. As the vanishing locus of an analytic function, W 0 is an analytic space over R (Hironaka, 1964, §0.1), (Griffith & Harris, 1978) and it is interesting to study the geometry of this space near the classical solutions. Since K is a Kullback-Leibler divergence it is non-negative and so it not only vanishes on W 0 but ∇K also vanishes, hence every point of W 0 is a singular point. Beyond this the geometry of W 0 depends on the particular model p(y|x, w) that has been chosen, but some aspects are universal: the nature of program synthesis means that typically W 0 is an extended object (i.e. it contains points other than the classical solutions) and the Hessian matrix of second order partial derivatives of K at a classical solution is not invertible -that is, the classical solutions are degenerate critical points of K. This means that singularity theory is the appropriate branch of mathematics for studying the geometry of W 0 near a classical solution. It also means that the Fisher information matrix I(w) ij = ∂ ∂w i log p(y|x, w) ∂ ∂w j log p(y|x, w) q(y|x)q(x)dxdy, is degenerate at a classical solution, so that the appropriate branch of statistical learning theory is singular learning theory (Watanabe, 2007; 2009) . For an introduction to singular learning theory in the context of deep learning see (Murfet et al., 2020) . Broadly speaking the contribution of this paper is to realise program synthesis within the framework of singular learning theory, at both a theoretical and an experimental level. In more detail the contents of the paper are: • We define a staged pseudo-UTM (Appendix E) which is well-suited to experiments with the ideas discussed above. Propagating uncertainty about the code through this UTM using the ideas of (Clift & Murfet, 2018) defines a triple (p, q, ϕ) associated to a synthesis problem. This formally embeds program synthesis within singular learning theory. • We realise this embedding in code by providing an implementation in PyTorch of this propagation of uncertainty through a UTM. Using the No-U-Turn variant of MCMC (Hoffman & Gelman, 2014) we can approximate the Bayesian posterior of any program synthesis problem (of course in practice we are limited by computational constraints in doing so).

