GEOMETRY OF PROGRAM SYNTHESIS

Abstract

We present a new perspective on program synthesis in which programs may be identified with singularities of analytic functions. As an example, Turing machines are synthesised from input-output examples by propagating uncertainty through a smooth relaxation of a universal Turing machine. The posterior distribution over weights is approximated using Markov chain Monte Carlo and bounds on the generalisation error of these models is estimated using the real log canonical threshold, a geometric invariant from singular learning theory.

1. INTRODUCTION

The idea of program synthesis dates back to the birth of modern computation itself (Turing, 1948) and is recognised as one of the most important open problems in computer science (Gulwani et al., 2017) . However, there appear to be serious obstacles to synthesising programs by gradient descent at scale (Neelakantan et al., 2016; Kaiser & Sutskever, 2016; Bunel et al., 2016; Gaunt et al., 2016; Evans & Grefenstette, 2018; Chen et al., 2018) and these problems suggest that it would be appropriate to make a fundamental study of the geometry of loss surfaces in program synthesis, since this geometry determines the learning process. To that end, in this paper we explain a new point of view on program synthesis using the singular learning theory of Watanabe (2009) and the smooth relaxation of Turing machines from Clift & Murfet (2018) . In broad strokes this new geometric point of view on program synthesis says: • Programs to be synthesised are singularities of analytic functions. If U ⊆ R d is open and K : U -→ R is analytic, then x ∈ U is a critical point of K if ∇K(x) = 0 and a singularity of the function K if it is a critical point where K(x) = 0. • The Kolmogorov complexity of a program is related to a geometric invariant of the associated singularity called the Real Log Canonical Threshold (RLCT). This invariant controls both the generalisation error and the learning process, and is therefore an appropriate measure of "complexity" in continuous program synthesis. See Section 3. • The geometry has concrete practical implications. For example, a MCMC-based approach to program synthesis will find, with high probability, a solution that is of low complexity (if it finds a solution at all). We sketch a novel point of view on the problem of "bad local minima" (Gaunt et al., 2016) based on these ideas. See Section 4. We demonstrate all of these principles in experiments with toy examples of synthesis problems. Program synthesis as inference. We use Turing machines, but mutatis mutandis everything applies to other programming languages. Let T be a Turing machine with tape alphabet Σ and set of states Q and assume that on any input x ∈ Σ * the machine eventually halts with output T (x) ∈ Σ * . Then to the machine T we may associate the set {(x, T (x))} x∈Σ * ⊆ Σ * × Σ * . Program synthesis is the study of the inverse problem: given a subset of Σ * × Σ * we would like to determine (if possible) a Turing machine which computes the given outputs on the given inputs. If we presume given a probability distribution q(x) on Σ * then we can formulate this as a problem of statistical inference: given a probability distribution q(x, y) on Σ * × Σ * determine the most likely machine producing the observed distribution q(x, y) = q(y|x)q(x). If we fix a universal Turing machine U then Turing machines can be parametrised by codes w ∈ W code with U(x, w) = T (x) for all x ∈ Σ * . We let p(y|x, w) denote the probability of U(x, w) = y (which is either zero or one) so that solutions to the synthesis problem are in bijection with the zeros of the Kullback-Leibler divergence between the true distribution and the model K(w) = q(y|x)q(x) log q(y|x) p (y|x, w) dxdy . (1) So far this is just a trivial rephrasing of the combinatorial optimisation problem of finding a Turing machine T with T (x) = y for all (x, y) with q(x, y) > 0. Smooth relaxation. One approach is to seek a smooth relaxation of the synthesis problem consisting of an analytic manifold W ⊇ W code and an extension of K to an analytic function K : W -→ R so that we can search for the zeros of K using gradient descent. Perhaps the most natural way to construct such a smooth relaxation is to take W to be a space of probability distributions over W code and prescribe a model p(y|x, w) for propagating uncertainty about codes to uncertainty about outputs (Gaunt et al., 2016; Evans & Grefenstette, 2018) . The particular model we choose is based on the semantics of linear logic (Clift & Murfet, 2018) . Supposing that such a smooth relaxation has been chosen together with a prior ϕ(w) over W , smooth program synthesis becomes the study of the statistical learning theory of the triple (p, q, ϕ). There are perhaps two primary reasons to consider the smooth relaxation. Firstly, one might hope that stochastic gradient descent or techniques like Markov chain Monte Carlo will be effective means of solving the original combinatorial optimisation problem. This is not a new idea (Gulwani et al., 2017, §6) but so far its effectiveness for large programs has not been proven. Independently, one might hope to find powerful new mathematical ideas that apply to the relaxed problem and shed light on the nature of program synthesis. This is the purpose of the present paper. Singular learning theory. We denote by W 0 = {w ∈ W | K(w) = 0} so that W 0 ∩ W code ⊆ W 0 ⊆ W where W 0 ∩W code is the discrete set of solutions to the original synthesis problem. We refer to these as the classical solutions. As the vanishing locus of an analytic function, W 0 is an analytic space over R (Hironaka, 1964, §0.1) , (Griffith & Harris, 1978) and it is interesting to study the geometry of this space near the classical solutions. Since K is a Kullback-Leibler divergence it is non-negative and so it not only vanishes on W 0 but ∇K also vanishes, hence every point of W 0 is a singular point. Beyond this the geometry of W 0 depends on the particular model p(y|x, w) that has been chosen, but some aspects are universal: the nature of program synthesis means that typically W 0 is an extended object (i.e. it contains points other than the classical solutions) and the Hessian matrix of second order partial derivatives of K at a classical solution is not invertible -that is, the classical solutions are degenerate critical points of K. This means that singularity theory is the appropriate branch of mathematics for studying the geometry of W 0 near a classical solution. It also means that the Fisher information matrix I(w) ij = ∂ ∂w i log p(y|x, w) ∂ ∂w j log p(y|x, w) q(y|x)q(x)dxdy, is degenerate at a classical solution, so that the appropriate branch of statistical learning theory is singular learning theory (Watanabe, 2007; 2009) . For an introduction to singular learning theory in the context of deep learning see (Murfet et al., 2020) . Broadly speaking the contribution of this paper is to realise program synthesis within the framework of singular learning theory, at both a theoretical and an experimental level. In more detail the contents of the paper are: • We define a staged pseudo-UTM (Appendix E) which is well-suited to experiments with the ideas discussed above. Propagating uncertainty about the code through this UTM using the ideas of (Clift & Murfet, 2018) defines a triple (p, q, ϕ) associated to a synthesis problem. This formally embeds program synthesis within singular learning theory. • We realise this embedding in code by providing an implementation in PyTorch of this propagation of uncertainty through a UTM. Using the No-U-Turn variant of MCMC (Hoffman & Gelman, 2014) we can approximate the Bayesian posterior of any program synthesis problem (of course in practice we are limited by computational constraints in doing so). • We explain how the real log canonical threshold (a geometric invariant) is related to Kolmogorov complexity (Section 3). • We give a simple example (Appendix C) in which W 0 contains the set of classical solutions as a proper subset and every point of W 0 is a degenerate critical point of K. • For two simple synthesis problems detectA and parityCheck we demonstrate all of the above, using MCMC to approximate the Bayesian posterior and theorems from Watanabe (2013) to estimate the RLCT (Section 5). We discuss how W 0 is an extended object and how the RLCT relates to the local dimension of W 0 near a classical solution.

RELATED WORK

The idea of synthesising Turing machines can be traced back to the work of Solomonoff on inductive inference (Solomonoff, 1964) . A more explicit form of the problem was given in Biermann (1972) who proposed an algorithmic method. Machine learning based approaches appear in Schmidhuber (1997) and Hutter (2004) , which pay particular attention to model complexity, and Gaunt et al. (2016) and Freer et al. (2014) , the latter using the notion of "universal probabilistic Turing machine" (De Leeuw et al., 1956) . A different probabilistic extension of a universal Turing machine was introduced in Clift & Murfet (2018) via linear logic. Studies of the singular geometry of learning models go back to Amari et al. (2003) and notably, the extensive work of Watanabe (2007; 2009) .

2. TURING MACHINE SYNTHESIS AS SINGULAR LEARNING

All known approaches to program synthesis can be formulated in terms of a singular learning problem. Singular learning theory is the extension of statistical learning theory to account for the fact that the set of learned parameters W 0 has the structure of an analytic space as opposed to an analytic manifold (Watanabe, 2007; 2009) . It is organised around triples (p, q, ϕ) consisting of a class of models {p(y|x, w) : w ∈ W }, a true distribution q(y|x) and a prior ϕ on W . In our approach we fix a Universal Turing Machine (UTM), denoted U, with a description tape (which specifies the code of the Turing machine to be executed), a work tape (simulating the tape of that Turing machine during its operation) and a state tape (simulating the state of that Turing machine). The general statistical learning problem that can be formulated using U is the following: given some initial string x on the work tape, predict the state of the simulated machine and the contents of the work tape after some specified number of steps (Clift & Murfet, 2018, §7.1) . For simplicity, in this paper we consider models that only predict the final state; the necessary modifications in the general case are routine. We also assume that W parametrises Turing machines whose tape alphabet Σ and set of states Q have been encoded by individual symbols in the tape alphabet of U. Hence U is actually what we call a pseudo-UTM (see Appendix E). Again, treating the general case is routine and for the present purposes only introduces uninteresting complexity. Let Σ denote the tape alphabet of the simulated machine, Q the set of states and let L, S, R stand for left, stay and right, the possible motions of the Turing machine head. We assume that |Q| > 1 since otherwise the synthesis problem is trivial. The set of ordinary codes W code for a Turing machine sits inside a compact space of probability distributions W over codes W code := σ,q Σ × Q × {L, S, R} ⊆ σ,q ∆Σ × ∆Q × ∆{L, S, R} =: W (3) where ∆X denotes the set of probability distributions over a set X, see (8), and the product is over pairs (σ, q) ∈ Σ × Q.foot_0 For example the point {(σ , q , d)} σ,q ∈ W code encodes the machine which when it reads σ under the head in state q writes σ , transitions into state q and moves in direction d. Given w ∈ W code let step t (x, w) ∈ Q denote the contents of the state tape of U after t timesteps (of the simulated machine) when the work tape is initialised with x and the description tape with w. There is a principled extension of this operation of U to a smooth function ∆ step t : Σ * × W -→ ∆Q (4) which propagates uncertainty about the symbols on the description tape to uncertainty about the final state and we refer to this extension as the smooth relaxation of U. The details are given in Appendix F but at an informal level the idea behind the relaxation is easy to understand: to sample from ∆ step t (x, w) we run U to simulate t timesteps in such a way that whenever the UTM needs to "look at" an entry on the description tape we sample from the corresponding distribution specified by w. 2 The significance of the particular smooth relaxation that we use is that its derivatives have a logical interpretation (Clift & Murfet, 2018, §7.1) . The class of models that we consider is p(y|x, w) = ∆ step t (x, w) ( ) where t is fixed for simplicity in this paper. More generally we could also view x as consisting of a sequence and a timeout, as is done in (Clift & Murfet, 2018, §7.1) . The construction of this model is summarised in Figure 1 . Definition 2.1 (Synthesis problem). A synthesis problem for U consists of a probability distribution q(x, y) over Σ * × Q. We say that the synthesis problem is deterministic if there is f : Σ * -→ Q such that q(y = f (x)|x) = 1 for all x ∈ Σ * . V j 0 / a R i W a 0 B Z R X O l u h A 3 l T N K W Z Z b T b q w p F h G n n W h 6 m / m d R 6 o N U / L B z m I a C j y W b M Q I t p n 0 p P R 0 U K 7 4 V X 8 B t E 6 C n F Q g R 3 N Q / u o P F U k E l Z Z w b E w v 8 G M b p l h b R j i d l / q J o T E m U z y m P U c l F t S E 6 e L W O b p w y h C N l H Y l L V q o v y d S L I y Z i c h 1 C m w n Z t X L x P + 8 X m J H N 2 H K Z J x Y K s l y 0 S j h y C q U P Y 6 G T F N i + c w R T D R z t y I y w R o T 6 + I p u R C C 1 Z f X S b t W D e r V q / t a p V H P 4 y j C G Z z D J Q R w D Q 2 4 g y a 0 g M A E n u E V 3 j z h v X j v O U B z E d K R E J R t F K v k G K f F C t u X V 3 A b J O v I L U o E B 7 U P 3 q D x O W x V w h k 9 S Y n u e m G O R U o 2 C S z y r 9 z P C U s g k d 8 Z 6 l i s b c B P n i 2 B m 5 s M q Q R I m 2 p Z A s 1 N 8 T O Y 2 N m c a h 7 Y w p j s 2 q N x f / 8 3 o Z R j d B L l S a I V d s u S j K J M G E z D 8 n Q 6 E 5 Q z m 1 h D I t 7 K 2 E j a m m D G 0 + F R u C t / r y O u k 0 6 l 6 z f n X f q L W a R R x l O I N z u A Q P r q E F d 9 A G H x g I e I Z X e H O U 8 + K 8 O x / L 1 p J T z J z C H z i f P w B V j s U = < / l a t e x i t > code < l a t e x i t s h a 1 _ b a s e 6 4 = " o m 3 R g / X z V o b O P X T C v L t E y j M G z b s = " > A A A B 6 3 i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y G i B 4 D X j x G M I m Q L G F 2 d p I M m c c y M y u E J b / g x Y M i X v 0 h b / 6 N s 8 k e N L G g o a j q p r s r S j g z 1 v e / v d L G 5 t b 2 T n m 3 s r d / c H h U P T 7 p G p V q Q j t E c a U f I 2 w o Z 5 J 2 L L O c P i a a Y h F x 2 o u m t 7 n f e 6 L a M C U f 7 C y h o c B j y U a M Y J t L R M V 0 W K 3 5 d X 8 B t E 6 C g t S g Q H t Y / R r E i q S C S k s 4 N q Y f + I k N M 6 w t I 5 z O K 4 P U 0 A S T K R 7 T v q M S C 2 r C b H H r H F 0 4 J U Y j p V 1 J i x b q 7 4 k M C 2 N m I n K d A t u J W f V y 8 T + v n 9 r R T Z g x m a S W S r J c N E o 5 s g r l j 6 O Y a U o s n z m C i W b u V k Q m W G N i X T w V F 0 K w + v I 6 6 T b q Q b N + d d + o t Z p F H G U 4 g 3 O 4 h A C u o Q V 3 0 I Y O E J j A M 7 z C m y e 8 F + / d + 1 i 2 l r x i 5 h T + w P v 8 A Q 5 R j j U = < / l a t e x i t > w < l a t e x i t s h a 1 _ b a s e 6 4 = " R 7 W K E 1 U f W 8 U l g i O q 9 c S z J J O g i H o = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y J R o 8 k X j x C I o 8 E N m R 2 a G B k d n Y z M 6 s h G 7 7 A i w e N 8 e o n e f N v H G A P C l b S S a W q O 9 1 d Q S y 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J S 0 e J Y t h k k Y h U J 6 A a B Z f Y N N w I 7 M Q K a R g I b A e T 2 7 n f f k S l e S T v z T R G P 6 Q j y Y e c U W O l x l O / W H L L 7 g J k n X g Z K U G G e r / 4 1 R t E L A l R G i a o 1 l 3 P j Y 2 f U m U 4 E z g r 9 B K N M W U T O s K u p Z K G q P 1 0 c e i M X F h l Q I a R s i U N W a i / J 1 I a a j 0 N A 9 s Z U j P W q 9 5 c / M / r J m Z 4 4 6 d c x o l B y Z a L h o k g J i L z r 8 m A K 2 R G T C 2 h T H F 7 K 2 F j q i g z N p u C D c F b f X m d t C p l r 1 q + a l R K t W o W R x 7 O 4 B w u w Y N r q M E d 1 K E J D B C e 4 R X e n A f n x X l 3 P p a t O S e b O Y U / c D 5 / A O J 3 j P M = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " t f W A W 1 S j m y N h r A 0 c a p f B + U l J 1 5 k = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y J R o 8 k X j x C I o 8 E N m R 2 a G B k d n Y z M 2 s k G 7 7 A i w e N 8 e o n e f N v H G A P C l b S S a W q O 9 1 d Q S y 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J S 0 e J Y t h k k Y h U J 6 A a B Z f Y N N w I 7 M Q K a R g I b A e T 2 7 n f f k S l e S T v z T R G P 6 Q j y Y e c U W O l x l O / W H L L 7 g J k n X g Z K U G G e r / 4 1 R t E L A l R G i a o 1 l 3 P j Y 2 f U m U 4 E z g r 9 B K N M W U T O s K u p Z K G q P 1 0 c e i M X F h l Q I a R s i U N W a i / J 1 I a a j 0 N A 9 s Z U j P W q 9 5 c / M / r J m Z 4 4 6 d c x o l B y Z a L h o k g J i L z r 8 m A K 2 R G T C 2 h T H F 7 K 2 F j q i g z N p u C D c F b f X m d t C p l r 1 q + a l R K t W o W R x 7 O 4 B w u w Y N r q M E d 1 K E J D B C e 4 R X e n A f n x X l 3 P p a t O S e b O Y U / c D 5 / A O P 7 j P Q = < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " G M 5 Z m X W 1 P J C O V O C v n U f f S h 1 c n 3 U = " > A A A B 6 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k m p 6 L H g x W M L 9 g P a U D b b S b t 2 s w m 7 G y G U / g I v H h T x 6 k / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S A T X x n W / n Y 3 N r e 2 d 3 c J e c f / g 8 O i 4 d H L a 1 n G q G L Z Y L G L V D a h G w S W 2 D D c C u 4 l C G g U C O 8 H k b u 5 3 n l B p H s s H k y X o R 3 Q k e c g Z N V Z q Z o N S 2 a 2 4 C 5 B 1 4 u W k D D k a g 9 J X f x i z N E J p m K B a 9 z w 3 M f 6 U K s O Z w F m x n 2 p M K J v Q E f Y s l T R C 7 U 8 X h 8 7 I p V W G J I y V L W n I Q v 0 9 M a W R 1 l k U 2 M 6 I m r F e 9 e b i f 1 4 v N e G t P + U y S Q 1 K t l w U p o K Y m M y / J k O u k B m R W U K Z 4 v Z W w s Z U U W Z s N k U b g r f 6 8 j p p V y t e r X L d r J b r t T y O A p z D B V y B B z d Q h 3 t o Q A s Y I D z D K 7 w 5 j 8 6 L 8 + 5 8 L F s 3 n H z m D P 7 A + f w B 5 X + M 9 Q = = < / l a t e x i t > step < l a t e x i t s h a 1 _ b a s e 6 4 = " N I 9 Q Q 7 X S t F Y 2 o O 5 u C p C o 3 h b a G C A = " > A A A C B X i c b V A 9 S w N B E N 3 z M 8 a v q K U W i 4 l g F e 5 C R M u A F p Y R z A c k I e x t J s m S v d t j d 0 4 I R x o b / 4 q N h S K 2 / g c 7 / 4 1 7 S Q p N f D D w e G + G m X l + J I V B 1 / 1 2 V l b X 1 j c 2 M 1 v Z 7 Z 3 d v f 3 c w W H d q F h z q H E l l W 7 6 z I A U I d R Q o I R m p I E F v o S G P 7 p O / c Y D a C N U e I / j C D o B G 4 S i L z h D K 3 V z J 4 X 2 D U h k t K 0 i 0 A y V D l k A i U G I J o V u L u 8 W 3 S n o M v H m J E / m q H Z z X + 2 e 4 n E A I X L J j G l 5 b o S d h G k U X M I k 2 4 4 N R I y P 2 A B a l q a r T C e Z f j G h Z 1 b p 0 b 7 S t k K k U / X 3 R M I C Y 8 a B b z s D h k O z 6 K X i f 1 4 r x v 5 V J x F h F C O E f L a o H 0 u K i q a R 0 J 7 Q w F G O L W F c C 3 s r 5 U O m G U c b X N a G 4 C 2 + v E z q p a J X L l 7 c l f K V 8 j y O D D k m p + S c e O S S V M g t q Z I a 4 e S R P J N X 8 u Y 8 O S / O u / M x a 1 1 x 5 j N H 5 A + c z x 9 2 L p i G < / l a t e x i t > init < l a t e x i t s h a 1 _ b a s e 6 4 = " e V V G L M w S r w O U 7 J I q X W o s w y Q d y 7 Y = " > A A A B / n i c b V B N S w M x E M 3 6 W e v X q n j y E m w F T 2 W 3 V P R Y 8 O K x g v 2 A d i n Z N G 1 D s 8 m S z A p l K f h X v H h Q x K u / w 5 v / x m y 7 B 2 1 9 M P B 4 b 4 a Z e W E s u A H P + 3 b W 1 j c 2 t 7 Y L O 8 X d v f 2 D Q / f o u G V U o i l r U i W U 7 o T E M M E l a w I H w T q x Z i Q K B W u H k 9 v M b z 8 y b b i S D z C N W R C R k e R D T g l Y q e + e l n s q Z p q A 0 p J E L O W S w 6 z c d 0 t e x Z s D r x I / J y W U o 9 F 3 v 3 o D R Z O I S a C C G N P 1 v R i C l G j g V L B Z s Z c Y F h M 6 I S P W t T R b Z Y J 0 f v 4 M X 1 h l g I d K 2 5 K A 5 + r v i Z R E x k y j 0 H Z G B M Z m 2 c v E / 7 x u A s O b w L 4 U J 8 A k X S w a J g K D w l k W e M A 1 o y C m l h C q u b 0 V 0 z H R h I J N r G h D 8 J d f X i W t a s W v V a 7 u q 6 V 6 L Y + j g M 7 Q O b p E P r p G d X S H G q i J K E r R M 3 p F b 8 6 T 8 + K 8 O x + L 1 j U n n z l B f + B 8 / g C K D Z X S < / l a t e x i t > work < l a t e x i t s h a 1 _ b a s e 6 4 = " T J W z 1 O U 7 5 Y F D d X 7 w l 0 / S Z i B Z v 6 w = " > A A A B 6 3 i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y G i B 4 D X j x G M A 9 I l j A 7 m S R D 5 r H M z C p h y S 9 4 8 a C I V 3 / I m 3 / j b L I H T S x o K K q 6 6 e 6 K Y s 6 M 9 f 1 v r 7 C x u b W 9 U 9 w t 7 e 0 f H B 6 V j 0 / a R i W a 0 B Z R X O l u h A 3 l T N K W Z Z b T b q w p F h G n n W h 6 m / m d R 6 o N U / L B z m I a C j y W b M Q I t p n 0 p P R 0 U K 7 4 V X 8 B t E 6 C n F Q g R 3 N Q / u o P F U k E l Z Z w b E w v 8 G M b p l h b R j i d l / q J o T E m U z y m P U c l F t S E 6 e L W O b p w y h C N l H Y l L V q o v y d S L I y Z i c h 1 C m w n Z t X L x P + 8 X m J H N 2 H K Z J x Y K s l y 0 S j h y C q U P Y 6 G T F N i + c w R T D R z t y I y w R o T 6 + I p u R C C 1 Z f X S b t W D e r V q / t a p V H P 4 y j C G Z z D J Q R w D Q 2 4 g y a 0 g M A E n u E V 3 j z h v X j v 3 s e y t e D l M 6 f w B 9 7 n D 0 s 7 j l 0 = < / l a t e x i t > state < l a t e x i t s h a 1 _ b a s e 6 4 = " S C 7 T 0 U N t x J Z e O T T + l 5 B L b I Y 5 O w g = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K R Y 8 F L x 4 r m F Z o Q 9 l s N + 3 S z S b s T o Q S + h u 8 e F D E q z / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W N z a 3 t n f J u Z W / / 4 P C o e n z S M U m m G f d Z I h P 9 G F L D p V D c R 4 G S P 6 a a 0 z i U v B t O b u d + 9 4 l r I x L 1 g N O U B z E d K R E J R t F K v k G K f F C t u X V 3 A b J O v I L U o E B 7 U P 3 q D x O W x V w h k 9 S Y n u e m G O R U o 2 C S z y r 9 z P C U s g k d 8 Z 6 l i s b c B P n i 2 B m 5 s M q Q R I m 2 p Z A s 1 N 8 T O Y 2 N m c a h 7 Y w p j s 2 q N x f / 8 3 o Z R j d B L l S a I V d s u S j K J M G E z D 8 n Q 6 E 5 Q z m 1 h D I t 7 K 2 E j a m m D G 0 + F R u C t / r y O u k 0 6 l 6 z f n X f q L W a R R x l O I N z u A Q P r q E F d 9 A G H x g I e I Z X e H O U 8 + K 8 O x / L 1 p J T z J z C H z i f P w B V j s U = < / l a t e x i t > code < l a t e x i t s h a 1 _ b a s e 6 4 = " o m 3 R g / X z V o b O P X T C v L t E y j M G z b s = " > A A A B 6 3 i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y G i B 4 D X j x G M I m Q L G F 2 d p I M m c c y M y u E J b / g x Y M i X v 0 h b / 6 N s 8 k e N L G g o a j q p r s r S j g z 1 v e / v d L G 5 t b 2 T n m 3 s r d / c H h U P T 7 p G p V q Q j t E c a U f I 2 w o Z 5 J 2 L L O c P i a a Y h F x 2 o u m t 7 n f e 6 L a M C U f 7 C y h o c B j y U a M Y J t L R M V 0 W K 3 5 d X 8 B t E 6 C g t S g Q H t Y / R r E i q S C S k s 4 N q Y f + I k N M 6 w t I 5 z O K 4 P U 0 A S T K R 7 T v q M S C 2 r C b H H r H F 0 4 J U Y j p V 1 J i x b q 7 4 k M C 2 N m I n K d A t u J W f V y 8 T + v n 9 r R T Z g x m a S W S r J c N E o 5 s g r l j 6 O Y a U o s n z m C i W b u V k Q m W G N i X T w V F 0 K w + v I 6 6 T b q Q b N + d d + o t Z p F H G U 4 g 3 O 4 h A C u o Q V 3 0 I Y O E J j A M 7 z C m y e 8 F + / d + 1 i 2 l r x i 5 h T + w P v 8 A Q 5 R j j U = < / l a t e x i t > work < l a t e x i t s h a 1 _ b a s e 6 4 = " T J W z 1 O U 7 5 Y F D d X 7 w l 0 / S Z i B Z v 6 w = " > A A A B 6 3 i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y G i B 4 D X j x G M A 9 I l j A 7 m S R D 5 r H M z C p h y S 9 4 8 a C I V 3 / I m 3 / j b L I H T S x o K K q 6 6 e 6 K Y s 6 M 9 f 1 v r 7 C x u b W 9 U 9 w t 7 e 0 f H B 6 V j 0 / a R i W a 0 B Z R X O l u h A 3 l T N K W Z Z b T b q w p F h G n n W h 6 m / m d R 6 o N U / L B z m I a C j y W b M Q I t p n 0 p P R 0 U K 7 4 V X 8 B t E 6 C n F Q g R 3 N Q / u o P F U k E l Z Z w b E w v 8 G M b p l h b R j i d l / q J o T E m U z y m P U c l F t S E 6 e L W O b p w y h C N l H Y l L V q o v y d S L I y Z i c h 1 C m w n Z t X L x P + 8 X m J H N 2 H K Z J x Y K s l y 0 S j h y C q U P Y 6 G T F N i + c w R T D R z t y I y w R o T 6 + I p u R C C 1 Z f X S b t W D e r V q / t a p V H P 4 y j C G Z z D J Q R w D Q 2 4 g y a 0 g M A E n u E V 3 j z h v X j v O w g = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K R Y 8 F L x 4 r m F Z o Q 9 l s N + 3 S z S b s T o Q S + h u 8 e F D E q z / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W N z a 3 t n f J u Z W / / 4 P C o e n z S M U m m G f d Z I h P 9 G F L D p V D c R 4 G S P 6 a a 0 z i U v B t O b u d + 9 4 l r I x L 1 g N O U B z E d K R E J R t F K v k G K f F C t u X V 3 A b J O v I L U o E B 7 U P 3 q D x O W x V w h k 9 S Y n u e m G O R U o 2 C S z y r 9 z P C U s g k d 8 Z 6 l i s b c B P n i 2 B m 5 s M q Q R I m 2 p Z A s 1 N 8 T O Y 2 N m c a h 7 Y w p j s 2 q N x f / 8 3 o Z R j d B L l S a I V d s u S j K J M G E z D 8 n Q 6 E 5 Q z m 1 h D I t 7 K 2 E j a m m D G 0 + F R u C t / r y O u k 0 6 l 6 z f n X f q L W a R R x l O I N z u A Q P r q E F d 9 A G H x g I e I Z X e H O U 8 + K 8 O x / L 1 p J T z J z C H z i f P w B V j s U = < / l a t e x i t > code < l a t e x i t s h a 1 _ b a s e 6 4 = " o m 3 R g / X z V o b O P X T C v L t E y j M G z b s = " > A A A B 6 3 i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y G i B 4 D X j x G M I m Q L G F 2 d p I M m c c y M y u E J b / g x Y M i X v 0 h b / 6 N s 8 k e N L G g o a j q p r s r S j g z 1 v e / v d L G 5 t b 2 T n m 3 s r d / c H h U P T 7 p G p V q Q j t E c a U f I 2 w o Z 5 J 2 L L O c P i a a Y h F x 2 o u m t 7 n f e 6 L a M C U f 7 C y h o c B j y U a M Y J t L R M V 0 W K 3 5 d X 8 B t E 6 C g t S g Q H t Y / R r E i q S C S k s 4 N q Y f + I k N M 6 w t I 5 z O K 4 P U 0 A S T K R 7 T v q M S C 2 r C b H H r H F 0 4 J U Y j p V 1 J i x b q 7 4 k M C 2 N m I n K d A t u J W f V y 8 T + v n 9 r R T Z g x m a S W S r J c N E o 5 s g r l j 6 O Y a U o s n z m C i W b u V k Q m W G N i X T w V F 0 K w + v I 6 6 T b q Q b N + d d + o t Z p F H G U 4 g 3 O 4 h A C u o Q V 3 0 I Y O E J j A M 7 z C m y e 8 F + / d + 1 i 2 l r x i 5 h T + w P v 8 A Q 5 R j j U = < / l a t e x i t > • • • < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 b C c X y z 6 H r i u D 2 v 7 1 F S D + L 8 A l n o = " > A A A B 7 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C b a C p 7 J b K n o s e P F Y w X 5 A u 5 R s N t u G Z p M 1 y Q p l 6 Z / w 4 k E R r / 4 d b / 4 b 0 3 Y P 2 v p g 4 P H e D D P z g o Q z b V z 3 2 y l s b G 5 t 7 x R 3 S 3 v 7 B 4 d H 5 e O T j p a p I r R N J J e q F 2 B N O R O 0 b Z j h t J c o i u O A 0 2 4 w u Z 3 7 3 S e q N J P i w U w T 6 s d 4 J F j E C D Z W 6 l U H J J R G V 4 f l i l t z F 0 D r x M t J B X K 0 h u W v Q S h J G l N h C M d a 9 z 0 3 M X 6 G l W G E 0 1 l p k G q a Y D L B I 9 q 3 V O C Y a j 9 b 3 D t D F 1 Y J U S S V L W H Q Q v 0 9 k e F Y 6 2 k c 2 M 4 Y m 7 F e 9 e b i f 1 4 / N d G N n z G R p I Y K s l w U p R w Z i e b P o 5 A p S g y f W o K J Y v Z W R M Z Y Y W J s R C U b g r f 6 8 j r p 1 G t e o 3 Z 1 X 6 8 0 G 3 k c R T i D c 7 g E D 6 6 h C X f Q g j Y Q 4 P A M r / D m P D o v z r v z s W w t O P n M K f y B 8 / k D Z p u P g w = = < / l a t e x i t > step < l a t e x i t s h a 1 _ b a s e 6 4 = " N I 9 Q Q 7 X S t F Y 2 o O 5 u C p C o 3 h b a G C A = " > A A A C B X i c b V A 9 S w N B E N 3 z M 8 a v q K U W i 4 l g F e 5 C R M u A F p Y R z A c k I e x t J s m S v d t j d 0 4 I R x o b / 4 q N h S K 2 / g c 7 / 4 1 7 S Q p N f D D w e G + G m X l + J I V B 1 / 1 2 V l b X 1 j c 2 M 1 v Z 7 Z 3 d v f 3 c w W H d q F h z q H E l l W 7 6 z I A U I d R Q o I R m p I E F v o S G P 7 p O / c Y D a C N U e I / j C D o B G 4 S i L z h D K 3 V z J 4 X 2 D U h k t K 0 i 0 A y V D l k A i U G I J o V u L u 8 W 3 S n o M v H m J E / m q H Z z X + 2 e 4 n E A I X L J j G l 5 b o S d h G k U X M I k 2 4 4 N R I y P 2 A B a l q a r T C e Z f j G h Z 1 b p 0 b 7 S t k K k U / X 3 R M I C Y 8 a B b z s D h k O z 6 K X i f 1 4 r x v 5 V J x F h F C O E f L a o H 0 u K i q a R 0 J 7 Q w F G O L W F c C 3 s r 5 U O m G U c b X N a G 4 C 2 + v E z q p a J X L l 7 c l f K V 8 j y O D D k m p + S c e O S S V M g t q Z I a 4 e S R P J N X 8 u Y 8 O S / O u / M x a 1 1 x 5 j N H 5 A + c z x 9 2 L p i G < / l a t e x i t > step < l a t e x i t s h a 1 _ b a s e 6 4 = " N I 9 Q Q 7 X S t F Y 2 o O 5 u C p C o 3 h b a G C A = " > A A A C B X i c b V A 9 S w N B E N 3 z M 8 a v q K U W i 4 l g F e 5 C R M u A F p Y R z A c k I e x t J s m S v d t j d 0 4 I R x o b / 4 q N h S K 2 / g c 7 / 4 1 7 S Q p N f D D w e G + G m X l + J I V B 1 / 1 2 V l b X 1 j c 2 M 1 v Z 7 Z 3 d v f 3 c w W H d q F h z q H E l l W 7 6 z I A U I d R Q o I R m p I E F v o S G P 7 p O / c Y D a C N U e I / j C D o B G 4 S i L z h D K 3 V z J 4 X 2 D U h k t K 0 i 0 A y V D l k A i U G I J o V u L u 8 W 3 S n o M v H m J E / m q H Z z X + 2 e 4 n E A I X L J j G l 5 b o S d h G k U X M I k 2 4 4 N R I y P 2 A B a l q a r T C e Z f j G h Z 1 b p 0 b 7 S t k K k U / X 3 R M I C Y 8 a B b z s D h k O z 6 K X i f 1 4 r x v 5 V J x F h F C O E f L a o H 0 u K i q a R 0 J 7 Q w F G O L W F c C 3 s r 5 U O m G U c b X N a G 4 C 2 + v E z q p a J X L l 7 c l f K V 8 j Definition 2.2. The triple (p, q, ϕ) associated to a synthesis problem is the model p of (5) together with the true distribution q and uniform prior ϕ on the parameter space W . The Kullback-Leibler function K(w) of the synthesis problem is defined by (1) and a solution to the synthesis problem is a point of W 0 . A classical solution is a point of W 0 ∩ W code . As ∆ step t is a polynomial function, K is analytic and so W 0 is a semi-analytic space (it is cut out of the semi-analytic space W by the vanishing of K). If the synthesis problem is deterministic and q(x) is uniform on some finite subset of Σ * then W 0 is semi-algebraic (it is cut out of W by polynomial equations) and all solutions lie at the boundary of the parameter space W (Appendix D). However in general W 0 is only semi-analytic and intersects the interior of W (Example C.2). We assume that q(y|x) is realisable that is, there exists w 0 ∈ W with q(y|x) = p(y|x, w 0 ). A triple (p, q, ϕ) is regular if the model is identifiable, ie. for all inputs x ∈ R n , the map sending w to the conditional probability distribution p(y|x, w) is one-to-one, and the Fisher information matrix is non-degenerate. Otherwise, the learning machine is strictly singular (Watanabe, 2009, §1.2.1). Triples arising from synthesis problems are typically singular: in Example 2.5 below we show an explicit example where multiple parameters w determine the same model, and in Example C.2 we give an example where the Hessian of K is degenerate everywhere on W 0 (Watanabe, 2009, §1.1.3). Remark 2.3. Non-deterministic synthesis problems arise naturally in various contexts, for example in the fitting of algorithms to the behaviour of deep reinforcement learning agents. Suppose an agent is acting in an environment with starting states encoded by x ∈ Σ * and possible episode end states by y ∈ Q. Even if the optimal policy is known to determine a computable function Σ * -→ Q the statistics of the observed behaviour after finite training time will only provide a function Σ * -→ ∆Q and if we wish to fit algorithms to behaviour it makes sense to deal with this uncertainty directly. Definition 2.4. Let (p, q, ϕ) be the triple associated to a synthesis problem. The Real Log Canonical Threshold (RLCT) λ of the synthesis problem is defined so that -λ is the largest pole of the meromorphic extension (Atiyah, 1970) of the zeta function ζ(z) = K(w) z ϕ(w)dw. The more singular the analytic space W 0 of solutions is, the smaller the RLCT. One way to think of the RLCT is as a count of the effective number of parameters near W 0 (Murfet et al., 2020, §4) . In Section 3 we relate the RLCT to Kolmogorov complexity and in Section 5 we estimate the RLCT of the synthesis problem detectA given below, using the method explained in Appendix A. Example 2.5 (detectA). The deterministic synthesis problem detectA has Σ = { , A, B}, Q = {reject, accept} and q(y|x) is determined by the function taking in a string x of A's and B's and returning the state accept if the string contains an A and state reject otherwise. The conditional true distribution q(y|x) is realisable because this function is computed by a Turing machine. Two solutions are shown in Figure 2 . On the left is a parameter w l ∈ W 0 \ W code and on the right is w r ∈ W 0 ∩ W code . Varying the distributions in w l that have nonzero entropy we obtain a submanifold V ⊆ W 0 containing w l of dimension 14. This leads by (Watanabe, 2009, Remark 7.3) to a bound on the RLCT of λ ≤ 1 2 (30 -14) = 8 which is consistent with the experimental results in Table 1 . This highlights that solutions need not lie at vertices of the probability simplex, and W 0 may contain a high-dimensional submanifold around a given classical solution. 

2.1. THE SYNTHESIS PROCESS

Synthesis is a problem because we do not assume that the true distribution is known: for example, if q(y|x) is deterministic and the associated function is f : Σ * -→ Q, we assume that some example pairs (x, f (x)) are known but no general algorithm for computing f is known (if it were, synthesis would have already been performed). In practice synthesis starts with a sample D n = {(x i , y i )} n i=1 from q(x, y) with associated empirical Kullback-Leibler distance K n (w) = 1 n n i=1 log q(y i |x i ) p(y i |x i , w) . ( ) If the synthesis problem is deterministic and u ∈ W code then K n (u) = 0 if and only if u explains the data in the sense that step t (x i , u) = y i for 1 ≤ i ≤ n. We now review two natural ways of finding such solutions in the context of machine learning. Synthesis by stochastic gradient descent (SGD). The first approach is to view the process of program synthesis as stochastic gradient descent for the function K : W -→ R. We view D n as a large training set and further sample subsets D m with m n and compute ∇K m to take gradient descent steps w i+1 = w i -η∇K m (w i ) for some learning rate η. Stochastic gradient descent has the advantage (in principle) of scaling to high-dimensional parameter spaces W , but in practice it is challenging to use gradient descent to find points of W 0 (Gaunt et al., 2016) . Synthesis by sampling. The second approach is to consider the Bayesian posterior associated to the synthesis problem, which can be viewed as an update on the prior distribution ϕ after seeing D n p(w|D n ) = p(D n |w)p(w) p(D n ) = 1 Z n ϕ(w) n i=1 p(y i |x i , w) = 1 Z 0 n exp{-nK n (w) + log ϕ(w)} where Z 0 n = ϕ(w) exp(-nK n (w))dw. If n is large the posterior distribution concentrates around solutions w ∈ W 0 and so sampling from the posterior will tend to produce machines that are (nearly) solutions. The gold standard sampling is Markov Chain Monte Carlo (MCMC). Scaling MCMC to where W is high-dimensional is a challenging task with many attempts to bridge the gap with SGD (Welling & Teh, 2011; Chen et al., 2014; Ding et al., 2014; Zhang et al., 2020) . Nonetheless in simple cases we demonstrate experimentally in Section 5 that machines may be synthesised by using MCMC to sample from the posterior.

3. COMPLEXITY OF PROGRAMS

Every Turing machine is the solution of a deterministic synthesis problem, so Section 2 associates to any Turing machine a singularity of a semi-analytic space W 0 . To indicate that this connection is not vacuous, we sketch how the complexity of a program is related to the real log canonical threshold of a singularity. A more detailed discussion will appear elsewhere. Let q(x, y) be a deterministic synthesis problem for U which only involves input sequences in some restricted alphabet Σ input , that is, q(x) = 0 if x / ∈ (Σ input ) * . Let D n be sampled from q(x, y) and let u, v ∈ W code ∩ W 0 be two explanations for the sample in the sense that K n (u) = K n (v) = 0. Which explanation for the data should we prefer? The classical answer based on Occam's razor (Solomonoff, 1964 ) is that we should prefer the shorter program, that is, the one using the fewest states and symbols. Set N = |Σ| and M = |Q|. Any Turing machine T using N ≤ N symbols and M ≤ M states has a code for U of length cM N where c is a constant. We assume that Σ input is included in the tape alphabet of T so that N ≥ |Σ input | and define the Kolmogorov complexity of q with respect to U to be the infimum c(q) of M N over Turing machines T that give classical solutions for q. Let λ be the RLCT of the triple (p, q, ϕ) associated to the synthesis problem (Definition 2.4). Theorem 3.1. λ ≤ 1 2 (M + N )c(q). Proof. Let u ∈ W code ∩ W 0 be the code of a Turing machine realising the infimum in the definition of the Kolmogorov complexity and suppose that this machine only uses symbols in Σ and states in Q with N = |Σ | and M = |Q |. The time evolution of the staged pseudo-UTM U simulating u on x ∈ Σ * input is independent of the entries on the description tape that belong to tuples of the form (σ, q, ?, ?, ?) with (σ, q) / ∈ Σ × Q . Let V ⊆ W be the submanifold of points which agree with u on all tuples with (σ, q) ∈ Σ × Q and are otherwise free. Then u ∈ V ⊆ W 0 and codim(V ) = M N (M + N ) and by (Watanabe, 2009, Theorem 7.3) we have λ ≤ 1 2 codim(V ). Remark 3.2. The Kolmogorov complexity depends only on the number of symbols and states used. The RLCT is a more refined invariant since it also depends on how each symbol and state is used (Clift & Murfet, 2018, Remark 7.8) as this affects the polynomials defining W 0 (see Appendix D).

4. PRACTICAL IMPLICATIONS

Using singular learning theory we have explained how programs to be synthesised are singularities of analytic functions, and how the Kolmogorov complexity of a program bounds the RLCT of the associated singularity. We now sketch some practical insights that follow from this point of view. Synthesis minimises the free energy: the sampling-based approach to synthesis (Section 2.1) aims to approximate, via MCMC, sampling from the Bayesian posterior for the triple (p, q, ϕ) associated to a synthesis problem. To understand the behaviour of these Markov chains we follow the asymptotic analysis of (Watanabe, 2009, Section 7.6 ). If we cover W by small closed balls V α around points w α then we can compute the probability that a sample comes from V α by p α = 1 Z 0 Vα e -nKn(w) ϕ(w)dw and if n is sufficiently large this is proportional to e -fα where the quantity f α = K α n + λ α log(n) is called the free energy. Here K α is the smallest value of the Kullback-Leibler divergence K on V α and λ α is the RLCT of the set W Kα ∩V α where W c = {w ∈ W | K(w) = c} is a level set of K. The Markov chains used to generate approximate samples from the posterior are attempting to minimise the free energy, which involves a tradeoff between the energy K α n and the entropy λ α log(n). Why synthesis gets stuck: the kind of local minimum of the free energy that we want the synthesis process to find are solutions w α ∈ W 0 where λ α is minimal. By Section 3 one may think of these points as the "lowest complexity" solutions. However it is possible that there are other local minima of the free energy. Indeed, there may be local minima where the free energy is lower than the free energy at any solution since at finite n it is possible to tradeoff an increase in K α against a decrease in the RLCT λ α . In practice, the existence of such "siren minima" of the free energy may manifest itself as regions where the synthesis process gets stuck and fails to converge to a solution. In such a region K α n + λ α log(n) < λ log(n) where λ is the RLCT of the synthesis problem. In practice it has been observed that program synthesis by gradient descent often fails for complex problems in the sense that it fails to converge to a solution (Gaunt et al., 2016) . While synthesis by SGD and sampling are different, it is a reasonable hypothesis that these siren minima are a significant contributing factor in both cases. Can we avoid siren minima? If we let λ c denote the RLCT of the level set W c then siren minima of the free energy will be impossible at a given value of n and c as long as λ c ≥ λ-c n log(n) . Recall that the more singular W c is the lower the RLCT, so this lower bound says that the level sets should not become too singular too quickly as c increases. At any given value of n there is a "siren free" region in the range c ≥ λ log(n) n since the RLCT is non-negative (Figure 3 ). Thus the learning process will be more reliable the smaller λ log(n) n is. This can arranged either by increasing n (providing more examples) or decreasing λ. While the RLCT is determined by the synthesis problem, it is possible to change its value by changing the structure of the UTM U. As we have defined it U is a "simulation type" UTM, but one could for example add special states such that if a code specifies a transition into that state a series of steps is executed by the UTM (i.e. a subroutine). This amounts to specifying codes in a higher level programming language. Hence one of the practical insights that can be derived from the geometric point of view on program synthesis is that varying this language is a natural way to engineer the singularities of the level sets of K, which according to singular learning theory has direct implications for the learning process. < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 7 p 8 0 2 A j P b 7 C s / < l a t e x i t s h a 1 _ b a s e 6 4 = " A 9 i s a v l J v 7 4 f j 8 B + w r J  E P / O p j 7 R a G D G U = " > A A A B 7 n i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i S i 6 L L o x m U F + 4 A 2 l J v J p B 0 6 m Y S Z i V B C P 8 K N C 0 X c + j 3 u / B u n b R b a e m D g c M 6 5 z L 0 n S A X X x n W / n d L a + s b m V n m 7 s r O 7 t 3 9 Q P T x q 6 y R T l L V o I h L V D V A z w S V r G W 4 E 6 6 a K Y R w I 1 g n G d z O / 8 8 S U 5 o l 8 N J O U + T E O J Y 8 4 R W O l T l / Y a I i D a s 2 t u 3 O Q V e I V p A Y F m o P q V z 9 M a B Y z a a h A r X u e m x o / R 2 U 4 F W x a 6 W e a p U j H O G Q 9 S y X G T P v 5 f N 0 p O b N K S K J E 2 S c N m a u / J 3 K M t Z 7 E g U 3 G a E Z 6 2 Z u J / 3 m 9 z E Q 3 f s 5 l m h k m 6 e K j K B P E J G R 2 O w m 5 Y t S I i S V I F b e 7 E j p C h d T Y h i q 2 B G / 5 5 F X S v q h 7 l / W r h 8 t a 4 7 a o o w w n c A r n 4 M E 1 N O A e m t A C C m N 4 h l d 4 c 1 L n x X l 3 P h b R k l P M H M M f O J 8 / P 1 2 P h Q = = < / l a t e x i t > log(n) n < l a t e x i t s h a 1 _ b a s e 6 4 = " l s Q V y n n X b J 6 G f q i 4 J W s S + 3 V 0 p m U = " > A A A C A n i c b V D L S g M x F M 3 U V 6 2 v U V f i J l i E u i k z U t F l 0 Y 3 L C v Y B n a F k M p k 2 N J M M S U Y o w + D G X 3 H j Q h G 3 f o U 7 / 8 a 0 n Y W 2 H g g c z r m H m 3 u C h F G l H e f b K q 2 s r q 1 v l D c r W 9 s 7 u 3 v 2 / k F H i V R i 0 s a C C d k L k C K M c t L W V D P S S y R B c c B I N x j f T P 3 u A 5 G K C n 6 v J w n x Y z T k N K I Y a S M N 7 C M v k g h n H j O R E E G P i W G N n + U Z z w d 2 1 a k 7 M 8 B l 4 h a k C g q 0 B v a X F w q c x o R r z J B S f d d J t J 8 h q S l m J K 9 4 q S I J w m M 0 J H 1 D O Y q J 8 r P Z C T k 8 N U o I I y H N 4 x r O 1 N + J D M V K T e L A T M Z I j 9 S i N x X / 8 / q p j q 7 8 j P I k 1 Y T j + a I o Z V A L O O 0 D h l Q S r N n E E I Q l N X + F e I R M J 9 q 0 V j E l u I s n L 5 P O e d 1 t 1 C / u G t X m d V F H G R y D E 1 A D L r g E T X A L W q A N M H g E z + A V v F l P 1 o v 1 b n 3 M R 0 t W k T k E f 2 B 9 / g A V 3 p c 9 < / l a t e x i t > c < l a t e x i t s h a 1 _ b a s e 6 4 = " X E i 9 u E s j W Q + X S 0 F V x z p F Q Z 5 h 7 e Q = " > A A A B 7 H i c b V B N S w M x E J 3 1 s 9 a v q k c v w V b w I G V X F P V k w Y v H C m 5 b a J e S T b N t a J J d k q x Q l v 4 G L x 4 U 8 e r v 8 D d 4 8 9 + Y b n v Q 1 g c D j / d m m J k X J p x p 4 7 r f z t L y y u r a e m G j u L m 1 v b N b 2 t t v 6 D h V h P o k 5 r F q h V h T z i T 1 D T O c t h J F s Q g 5 b Y b D 2 4 n f f K R K s 1 g + m F F C A 4 H 7 k k W M Y G M l v 9 I 5 J Z V u q e x W 3 R x o k X g z U r 7 5 h B z 1 b u m r 0 4 t J K q g 0 h G O t 2 5 6 b m C D D y j D C 6 b j Y S T V N M B n i P m 1 b K r G g O s j y Y 8 f o 2 C o 9 F M X K l j Q o V 3 9 P Z F h o P R K h 7 R T Y D P S 8 N x H / 8 9 q p i a 6 C j M k k N V S S 3 O / B k f d s = " > A A A B 8 3 i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 k J I U R Y 8 F L x 4 r 2 A 9 o Q t l s J + 3 S z S b s b o Q S + j e 8 e F D E q 3 / G m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M B V c G 9 f 9 d t b W N z a 3 t k s 7 5 d 2 9 / Y P D y t F x W y e Z Y t h i i U h U N 6 Q a B Z f Y M t w I 7 K Y K a R w K 7 I T j u 5 n f e U K l e S I f z S T F I K Z D y S P O q L G S 7 1 9 q r l C S S C H 2 K 1 W 3 5 s 5 B V o l X k C o U a P Y r X / 4 g Y V m M 0 j B B t e 5 5 b m q C n C r D m c B p 2 c 8 0 p p S N 6 R B 7 l k o a o w 7 y + c 1 T c m 6 V A Y k S Z U s a M l d / T + Q 0 1 n o S h 7 Y z p m a k l 7 2 Z + J / X y 0 x 0 G + R c p p l B y R a L o k w Q k 5 B Z A G R g H 2 Z G T C y h T H F 7 K 2 E j q i g z N q a y D c F b f n m V t O s 1 7 6 p 2 / V C v N u p F H C U 4 h T O 4 A A 9 u o A H 3 0 I Q W M E j h G V 7 h z c m c F + f d + V i 0 r j n F z A n 8 g f P 5 A 4 3 7 k V M = < / l a t e x i t > slope n log(n) < l a t e x i t s h a 1 _ b a s e 6 4 = " P M h V v a s l f S L T Y I x 1 a Q V X J 6 7 J T 2 4 = " > A A A C C H i c b V D L S g M x F M 3 U V 6 2 v U Z c u D L Z C h V p m i q L L g h u X F e w D O k P J p J k 2 N J M M S U Y o Q 5 d u / B U 3 L h R x 6 y e 4 8 2 9 M 2 1 l o 6 4 E L h 3 P u 5 d 5 7 g p h R p R 3 n 2 8 q t r K 6 t b + Q 3 C 1 v b O 7 t 7 9 v 5 B S 4 l E Y t L E g g n Z C Z A i j H L S 1 F Q z 0 o k l Q V H A S D s Y 3 U z 9 9 g O R i g p + r 8 c x 8 S M 0 4 D S k G G k j 9 e x j r + J V F B M x g a V z T 4 c S 4 Z R P U o + J Q Z m f T U o 9 u + h U n R n g M n E z U g Q Z G j 3 7 y + s L n E S E a 8 y Q U l 3 X i b W f I q k p Z m R S 8 B J F Y o R H a E C 6 h n I U E e W n s 0 c m 8 N Q o f R g K a Y p r O F N / T 6 Q o U m o c B a Y z Q n q o F r 2 p + J / X T X R 4 7 a e U x 4 k m H M 8 X h Q m D W s B p K r B P J c G a j Q 1 B W F J z K 8 R D Z M L Q J r u C C c F d f H m Z t G p V 9 6 J 6 e V c r 1 m t Z H H l w B E 5 A G b j g C t T B L W i A J s D g E T y D V / B m P V k v 1 r v 1 M W / N W d n M I f g D 6 / M H M d W Y x g = = < / l a t e x i t >

5. EXPERIMENTS

We estimate the RLCT for the triples (p, q, ϕ) associated to the synthesis problems detectA (Example 2.5) and parityCheck. Hyperparameters of the various machines are contained in Table 3 of Appendix B. The true distribution q(x) is defined as follows: we fix a minimum and maximum sequence length a ≤ b and to sample x ∼ q(x) we first sample a length l uniformly from [a, b] and then uniformly sample x from {A, B} l . We perform MCMC on the weight vector for the model class {p(y|x, w) : w ∈ W } where w is represented in our PyTorch implementation by three tensors of shape {[L, n i ]} 1≤i≤3 where L is the number of tuples in the description tape of the TM being simulated and {n i } are the number of symbols, states and directions respectively. A direct simulation of the UTM is used for all experiments to improve computational efficiency (Appendix G). We generate, for each inverse temperature β and dataset D n , a Markov chain via the No-U-turn sampler from Hoffman & Gelman (2014). We use the standard uniform distribution as our prior ϕ. For the problem detectA given in Example 2.5 the dimension of parameter space is dim W = 30. We use generalized least squares to fit the RLCT λ (with goodness-of-fit measured by R 2 ), the algorithm of which is given in Appendix A. Our results are displayed in Table 1 and Figure The distribution q(x) is as discussed in Section 5 and q(y|x) is determined by the function taking in a string of A's and B's, and terminating in state accept if the string contains the same number of A's as B's, and terminating in state reject otherwise. The string is assumed to contain no blank symbols. The true distribution is realisable because there is a Turing machine using Σ and Q which computes this function: the machine works by repeatedly overwriting pairs consisting of a single A and B with X's; if there are any A's without a matching B left over (or vice versa), we reject, otherwise we accept. In more detail, the starting state getNextAB moves right on the tape until the first A or B is found, and overwrites it with an X. 

6. DISCUSSION

We have developed a theoretical framework in which all programs can in principle be learnt from input-output examples via an existing optimisation procedure. This is done by associating to each program a smooth relaxation which, based on Clift & Murfet (2018) , can be argued to be more canonical than existing approaches. This realization has important implications for the building of intelligent systems. In approaches to program synthesis based on gradient descent there is a tendency to think of solutions to the synthesis problem as isolated critical points of the loss function K, but this is a false intuition based on regular models. Since neural networks, Bayesian networks, smooth relaxations of UTMs and all other extant approaches to smooth program synthesis are strictly singular models (the map from parameters to functions is not injective) the set W 0 of parameters w with K(w) = 0 is a complex extended object, whose geometry is shown by Watanabe's singular learning theory to be deeply related to the learning process. We have examined this geometry in several specific examples and shown how to think about complexity of programs from a geometric perspective. It is our hope that algebraic geometry can assist in developing the next generation of synthesis machines. Each RLCT estimate λ(D n ) in Algorithm 1 was performed by linear regression on the pairs {(1/β i , E βi w [nL n (w)])} 5 i=1 where the five inverse temperatures β i are centered on the inverse temperature 1/T where T is the temperature reported for each experiment in Table 1 and Table 2. From a Bayesian perspective, predictions about outputs y should be made using the predictive distribution p * (y|x, D n ) = p(y|x, w)p(w|D n )dw . The Bayesian generalisation error associated to the Bayesian predictor is defined as the Kullback-Leibler distance to the true conditional distribution B g (n) := D KL (q p * ) = q(y|x)q(x) log q(y|x) p * (y|x) dydx. If some fundamental conditions are satisfied (Definition 6.1 and Definition 6.3 of Watanabe ( 2009)), then by Theorem 6.8 of loc.cit., there exists a random variable B * g such that as n → ∞, E[nB g (n)] converges to E[B * g ]. In particular, by Theorem 6.10 of Watanabe ( 2009), E[B * g ] = λ.

B HYPERPARAMETERS

The hyperparameters for the various synthesis tasks are contained in Table 3 . The number of samples is R in Algorithm 1 and the number of datasets is |T |. Samples are taken according to the Dirichlet distribution, a probability distribution over the simplex, which is controlled by the concentration. When the concentration is a constant across all dimensions, as is assumed here, this corresponds to a density which is symmetric about the uniform probability mass function occurring in the centre of the simplex. The value α = 1.0 corresponds to the uniform distribution over the simplex. Finally, the chain temperature controls the default β value, ie. all inverse temperature values are centered around 1/T where T is the chain temperature. 

C THE SHIFT MACHINE

The pseudo-UTM U is a complicated Turing machine, and the models p(y|x, w) of Section 2 are therefore not easy to analyse by hand. To illustrate the kind of geometry that appears, we study the simple Turing machine shiftMachine of Clift & Murfet (2018) and formulate an associated statistical learning problem. The tape alphabet is Σ = { , A, B, 0, 1, 2} and the input to the machine will be a string of the form na 1 a 2 a 3 where n is called the counter and a i ∈ {A, B}. The transition function, given in loc.cit., will move the string of A's and B's leftwards by n steps and fill the right hand end of the string with A's, keeping the string length invariant. For example, if 2BAB is the input to M , the output will be 0BAA . Set W = ∆{0, 2} × ∆{A, B} and view w = (h, k) ∈ W as representing a probability distribution (1 -h) • 0 + h • 2 for the counter and (1 -k) • B + k • A for a 1 . The model is p y|x = (a 2 , a 3 ), w = (1 -h) 2 k • A + (1 -h) 2 (1 -k) • B + 3 i=2 2 i -1 h i-1 (1 -h) 3-i • a i . Under review as a conference paper at ICLR 2021 The space W of distributions is therefore contained in the affine space with coordinate ring R W = R x σ,q σ σ,q,σ , y σ,q q σ,q,q , z σ,q d σ,q,d . The function F x = ∆ step t (x, -) : W -→ ∆Q is polynomial (Clift & Murfet, 2018, Proposition 4. 2) and we denote for s ∈ Q by F x s ∈ R W the polynomial computing the associated component of the function F x . Let ∂W denote the boundary of the manifold with corners W , that is, the set of all points on W where at least one of the coordinate functions given above vanishes ∂W = V σ,q σ ∈Q x σ,q σ q ∈Q y σ,q q d∈{L,S,R} z σ,q d where V(h) denotes the vanishing locus of h. Lemma D.1. W 0 = W . Proof. Choose x ∈ X with q(x) > 0 and let y be such that q(y|x) = 1. Let w ∈ W code be the code for the Turing machine which ignores the symbol under the head and current state, transitions to some fixed state s = y and stays. Then w / ∈ W 0 . Lemma D.2. The set W 0 is semi-algebraic and W 0 ⊆ ∂W . Proof. Given x ∈ Σ * with q(x) > 0 we write y = y(x) for the unique state with q(x, y) = 0. In this notation the Kullback-Leibler divergence is K(w) = x∈X c D KL (y F x (w)) = -c x∈X log F x y (w) = -c log x∈X F x y (w) . Hence W 0 = W ∩ x∈X V(1 -F x y (w)) is semi-algebraic. Recall that the function ∆ step t is associated to an encoding of the UTM in linear logic by the Sweedler semantics (Clift & Murfet, 2018) and the particular polynomials involved have a form that is determined by the details of that encoding (Clift & Murfet, 2018, Proposition 4.3) . From the design of our UTM we obtain positive integers l σ , m q , n d for σ ∈ Σ, q ∈ Q, d ∈ {L, S, R} and a function π : Θ -→ Q where Θ = σ,q Σ lσ × Q mq × {L, S, R} n d . We represent elements of Θ by tuples (µ, ζ, ξ) ∈ Θ where µ(σ, q, i) ∈ Σ for σ ∈ Σ, q ∈ Q and 1 ≤ i ≤ l σ and similarly ζ(σ, q, j) ∈ Q and ξ(σ, q, k) ∈ {L, S, R}. The polynomial F x s is F x s = (µ,ζ,ξ)∈Θ δ(s = π(µ, ζ, ξ)) σ,q lσ i=1 x σ,q µ(σ,q,i) mq j=1 y σ,q ζ(σ,q,j) n d k=1 z σ,q ξ(σ,q,k) where δ is a Kronecker delta. With this in hand we may compute W 0 = W ∩ x∈X V(1 -F x y (w)) = W ∩ x∈X s =y V(F x s (w)) . But F x s is a polynomial with non-negative integer coefficients, which takes values in [0, 1] for w ∈ W . Hence it vanishes on w if and only if for each triple µ, ζ, ξ with s = π(µ, ζ, ξ) one or more of the coordinate functions x σ,q µ(σ,q,i) , y σ,q ζ(σ,q,j) , z σ,q ξ(σ,q,k) vanishes on w. The desired conclusion follows unless for every x ∈ X and (µ, ζ, ξ) ∈ Θ we have π(µ, ζ, ξ) = y so that F x s = 0 for all s = y. But in this case case W 0 = W which contradicts Lemma D.1. Figure 6 : The UTM. Each of the rectangles are states, and an arrow q → q has the following interpretation: if the UTM is in state q and sees the tape symbols (on the four tapes) as indicated by the source of the arrow, then the UTM transitions to state q , writes the indicated symbols (or if there is no write instruction, simply rewrites the same symbols back onto the tapes), and performs the indicated movements of each of the tape heads. The symbols a, b, c, d stand for generic symbols which are not X. We can associate to M a discrete dynamical system M = (Σ Z, × Q, step) where step : Σ Z, × Q → Σ Z, × Q is the step function defined by step(σ, q) = α δ3(σ0,q) . . . , σ -2 , σ -1 , δ 1 (σ 0 , q), σ 1 , σ 2 , . . . , δ 2 (σ 0 , q) . with shift map α δ3(σ0,q) (σ) u = σ u+δ3(σ0,q) . Let X be a finite set. The standard X-simplex is defined as ∆X = { x∈X λ x x ∈ RX| x λ x = 1 and λ x ≥ 0 for all x ∈ X} (8) where RX is the free vector space on X. We often identify X with the vertices of ∆X under the canonical inclusion i : X → ∆X given by i (x) = x ∈X δ x=x x . For example {0, 1} ⊂ ∆({0, 1}) [0, 1]. A tape square is said to be at relative position u ∈ Z if it is labelled u after enumerating all squares in increasing order from left to right such that the square currently under the head is assigned zero. Consider the following random variables at times t ≥ 0: • Y u,t ∈ Σ: the content of the tape square at relative position u at time t. • S t ∈ Q: the internal state at time t. • W r t ∈ Σ: the symbol to be written, in the transition from time t to t + 1. • M v t ∈ {L, S, R}: the direction to move, in the transition from time t to t + 1. We call a smooth dynamical system a pair (A, φ) consisting of a smooth manifold A with corners together with a smooth transformation φ : A → A. Definition F.1. Let M = (Σ, Q, δ) be a Turing machine. The smooth relaxation of M is the smooth dynamical system ((∆Σ) Z, × ∆Q, ∆step) where ∆step : (∆Σ) Z, × ∆Q → (∆Σ) Z, × ∆Q is a smooth transformation sending a state ({P (Y u,t )} u∈Z , P (S t )) to ({P (Y u,t+1 )} u∈Z , P (S t+1 )) determined by the equations • P (M v t = d|C) = σ,q δ δ3(σ,q)=d P (Y 0,t = σ|C)P (S t = q|C), • P (W r t = σ|C) = σ ,q δ δ1(σ ,q)=σ P (Y 0,t = σ |C)P (S t = q|C), • P (S t+1 = q|C) = σ,q δ δ2(σ,q )=q P (Y 0,t = σ|C)P (S t = q |C), • P (Y u,t+1 = σ|C) = P (M v t = L|C) δ u =1 P (Y u-1,t = σ|C) + δ u=1 P (W r t = σ|C) + P (M v t = S|C) δ u =0 P (Y u,t = σ|C) + δ u=0 P (W r t = σ|C) + P (M v t = R|C) δ u =-1 P (Y u+1,t = σ|C) + δ u=-1 P (W r t = σ|C) , where C ∈ (∆Σ) Z, × ∆Q is an initial state. We will call the smooth relaxation of a Turing machine a smooth Turing machine. A smooth Turing machine encodes uncertainty in the initial configuration of a Turing machine together with an update rule for how to propagate this uncertainty over time. We interpret the smooth step function as updating the state of belief of a "naive" Bayesian observer. This nomenclature comes from the assumption of conditional independence between random variables in our probability functions. Remark F.2. Propagating uncertainty using standard probability leads to a smooth dynamical system which encodes the state evolution of an "ordinary" Bayesian observer of the Turing machine. This requires the calculation of various joint distributions which makes such an extension computationally difficult to work with. Computation aside, the naive probabilistic extension is justified from the point of view of derivatives of algorithms according to the denotational semantics of differential linear logic. See Clift & Murfet (2018) for further details. We call the smooth extension of a universal Turing machine a smooth universal Turing machine. Recall that the staged pseudo-UTM U has four tapes: the description tape, the staging tape, the state tape and working tape. The smooth relaxation of U is a smooth dynamical system where the first factor is the configuration of the staging tape. Since U is periodic of period T = 10N + 5 (Appendix E) the iterated function (∆ step U ) T takes an input with staging tape in its default state XXX and UTM state compSymbol and returns a configuration with the same staging tape and state, but with the configuration of the work tape, description tape and state tape updated by one complete simulation step. That is, (∆ step U ) T x, w, q, XXX, compSymbol) = (F (x, w, q), XXX, compSymbol for some smooth function F : (∆Σ) Z, × W × ∆Q -→ (∆Σ) Z, × W × ∆Q . (9) Finally we can define the function ∆ step t of (4). We assume all Turing machines are initialised in some common state init ∈ Q. Definition F.3. Given t ≥ 0 we define ∆ step t : Σ * × W -→ ∆Q by ∆ step t (x, w) = Π Q F t (x, w, init) where Π Q is the projection onto ∆Q.

G DIRECT SIMULATION

For computational efficiency in our PyTorch implementation of the staged pseudo-UTM we implement F of (9) rather than ∆ step U . We refer to this as direction simulation since it means that we update in one step the state and working tape of the UTM for a full cycle where a cycle consists of T = 10N + 5 steps of the UTM. Let S(t) and Y u (t) be random variables describing the contents of state tape and working tape in relative positions 0, u respectively after t ≥ 0 time steps of the UTM. We define S(t) := S(4 + T t) and Y u (t) := Y u (4 + T t) where t ≥ 0 and u ∈ Z. The task then is to define functions f, g such that S(t + 1) = f ( S(t)) Y u (t + 1) = g( Y u (t)). The functional relationship is given as follows: for 1 ≤ i ≤ N indexing tuples on the description tape, while processing that tuple, the UTM is in a state distribution λ i • q + (1λ i ) • ¬q where q ∈ {copySymbol, copyState, copyDir}. Given the initial state of the description tape, we assume uncertainty about s , q , d only. This determines a map θ : {1, . . . , N } → Σ × Q where the description tape at tuple number i is given by θ(i) 1 θ(i) 2 P (s i )P (q i )P (d i ). We define the conditionally independent joint distribution between { Y 0,t-1 , S t-1 } by λ i = σ∈Σ δ θ(i)1=σ P ( Y 0,t-1 = σ) • q∈Q δ θ(i)2=q P ( S t-1 = q) = P ( Y 0,t-1 = θ(i) 1 ) • P ( S t-1 = θ(i) 2 ). We then calculate a recursive set of equations for 0 ≤ j ≤ N describing distributions P (ŝ j ), P (q j ) and P ( dj ) on the staging tape after processing all tuples up to and including tuple j. These are given by P (ŝ 0 ) = P (q 0 ) = P ( d0 ) = 1 • X and P (ŝ i ) = σ∈Σ {λ i • P (s i = σ) + (1 -λ i ) • P (ŝ i-1 = σ)} • σ + (1 -λ i ) • P (ŝ i-1 = X) • X P (q i ) = q∈Q {λ i • P (q i = q) + (1λ i ) • P (q i-1 = q)} • q + (1λ i ) • P (q i-1 = X) • X P ( di ) = a∈{L,R,S} {λ i • P (d i = a) + (1 -λ i ) • P ( di-1 = a)} • a + (1 -λ i ) • P ( di-1 = X) • X. Let A σ = P (ŝ N = X) • P ( Y 0,t-1 = σ) + P (ŝ N = σ). In terms of the above distributions P ( S t ) = q∈Q P (q N = X) • P ( S t-1 = q) + P (q N = q) • q



The space W of parameters is clearly semi-analytic, that is, it is cut out of R d for some d by the vanishing f1(x) = • • • = fr(x) = 0 of finitely many analytic functions on open subsets of R d together with finitely many inequalities g1(x) ≥ 0, . . . , gs(x) ≥ 0 where the gj(x) are analytic. In fact W is semi-algebraic, since the fi and gj may all be chosen to be polynomial functions. Noting that this sampling procedure is repeated every time the UTM looks at a given entry.



work < l a t e x i t s h a 1 _ b a s e 6 4 = " T J W z 1 O U 7 5 Y F D d X 7 w l 0 / S Z i B Z v 6 w = " > A A A B 6 3 i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y G i B 4 D X j x G M A 9 I l j A 7 m S R D 5 r H M z C p h y S 9 4 8 a C I V 3 / I m 3 / j b L I H T S x o K K q 6 6 e 6 K Y s 6 M 9 f 1 v r 7 C x u b W 9 U 9 w t 7 e 0 f H B 6

3 s e y t e D l M 6 f w B 9 7 n D 0 s 7 j l 0 = < / l a t e x i t > state < l a t e x i t s h a 1 _ b a s e 6 4 = " S C 7 T 0 U N t x J Z e O T T + l 5 B L b I Y 5 O w g = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K R Y 8 F L x 4 r m F Z o Q 9 l s N + 3 S z S b s T o Q S + h u 8 e F D E q z / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W N z a 3 t n f J u Z W / / 4 P C o e n z S M U m m G f d Z I h P 9 G F L D p V D c R 4 G S P 6 a a 0 z i U v B t O b u d + 9 4 l r I x L 1 g N

3 s e y t e D l M 6 f w B 9 7 n D 0 s 7 j l 0 = < / l a t e x i t > state < l a t e x i t s h a 1 _ b a s e 6 4 = " S C 7 T 0 U N t x J Z e O T T + l 5 B L b I Y 5

Figure 1: The state of U is represented by the state of the work tape, state tape and description (code) tape. The work tape is initialised with a sequence x ∈ Σ * , the code tape with w ∈ W and the state tape with some standard initial state, the smooth relaxation ∆ step of the pseudo-UTM is run for t steps and the final probability distribution over states is y.

Figure 2: Visualisation of two solutions for the synthesis problem detectA .

6 a I o 5 c j E a P I 5 6 j F F i e E j S z B R z N 6 K y A A r T I z N p 2 h D 8 O Z f X i S N s 6 p 3 X r 2 4 d 8 u 1 6 2 k a U I B D O I I T 8 O A S a n A H d f C B A I M n e I F X R z r P z p v z P m 1 d c m Y z B / A H z s c P 8 F S O 2 Q = = < / l a t e x i t > siren free

Figure 3: Level sets above the cutoff cannot contain siren local minima of the free energy.

4. Our purpose in these experiments is not to provide high accuracy estimates of the RLCT, as these would require much longer Markov chains. Instead we demonstrate how rough estimates consistent with the theory can be obtained at low computational cost. If this model were regular the RLCT would be dim W/2 = 15.

Figure 4: Plot of RLCT estimates for detectA. Shaded region shows one standard deviation. The deterministic synthesis problem parityCheck has Σ = { , A, B, X} Q = {reject, accept, getNextAB, getNextA, getNextB, gotoStart}.

∆step U : [(∆Σ UTM ) Z, ] 4 × ∆Q UTM → [(∆Σ UTM ) Z, ] 4 × ∆Q UTM .If we use the staged pseudo-UTM to simulate a Turing machine with tape alphabet Σ ⊆ Σ UTM and states Q ⊆ Σ UTM then with some determined initial state the function ∆step restricts to∆step U : (∆Σ) Z, × W × ∆Q × X -→ (∆Σ) Z, × W × ∆Q × Xwhere the first factor is the configuration of the work tape, W is as in (3) and X = [(∆Σ UTM ) Z, ] × ∆Q UTM

RLCT estimates for detectA.

If it's an A (resp. B) we enter state getNextB (resp. getNextA). If no A or B is found, we enter the state accept. The state getNextA (resp. getNextB) moves right until an A (resp. B) is found, overwrites it with an X and enters state gotoStart which moves left until a blank symbol is found (resetting the machine to the left end of the tape). If no A's (resp. B's) were left on the tape, we enter state reject. The dimension of the parameter space is dim W = 240. If this model were regular, the RLCT would be dim W/2 = 120. Our RLCT estimates are contained in Table2. RLCT estimates for parityCheck.

Hyperparameters for Datasets and MCMC.

APPENDIX A ALGORITHM FOR ESTIMATING RLCTS

Given a sample D n = {(x i , y i )} n i=1 from q(x, y) let L n (w) := -1 n n i=1 log p(y i |x i , w) be the negative log likelihood. We would like to estimatewhere Z β n = ϕ(w) n i=1 p(y i |x i , w) β dw for some inverse temperature β. If β = β0 log n for some constant β 0 , then by Theorem 4 of Watanabe ( 2013),where {U n } is a sequence of random variables satisfying E[U n ] = 0 and λ is the RLCT. In practice, the last two terms often vary negligibly with 1/β and so E β w [nL n (w)] approximates a linear function of 1/β with slope λ (Watanabe, 2013, Corollary 3). This is the foundation of the RLCT estimation procedure found in Algorithm 1 which is used in our experiments. This model is derived by propagating uncertainty through shiftMachine in the same way that p(y|x, w) is derived from ∆ step t in Section 2 by propagating uncertainty through U. We assume that some distribution q(x) over {A, B} 2 is given. Example C.1. Suppose q(y|x) = p(y|x, w 0 ) where w 0 = (1, 1). It is easy to see thatis a semi-algebraic variety, that is, it is defined by polynomial equations and inequalities. Here V(h) denotes the vanishing locus of a function h.Example C.2. Suppose q(AB) = 1 and q(y|x = AB)Note that f has no critical points, and so2 is regular while the curve 4f (1f ) = 1 is singular and it is the geometry of the singular curve that is related to the behaviour of K. This curve is shown in Figure 5 . It is straightforward to check that the determinant of the Hessian of K is identically zero on W 0 , so that every point on W 0 is a degenerate critical point of K. 

D GENERAL SOLUTION FOR DETERMINISTIC SYNTHESIS PROBLEMS

In this section we consider the case of a deterministic synthesis problem q(x, y) which is finitely supported in the sense that there exists a finite set X ⊆ Σ * such that q(x) = c for all x ∈ X and q(x) = 0 for all x / ∈ X . We first need to discuss the coordinates on the parameter space W of (3). To specify a point on W is to specify for each pair (σ, q) ∈ Σ × Q (that is, for each tuple on the description tape) a triple of probability distributions

E STAGED PSEUDO-UTM

Simulating a Turing machine M with tape alphabet Σ and set of states Q on a standard UTM requires the specification of an encoding of Σ and Q in the tape alphabet of the UTM. From the point of view of exploring the geometry of program synthesis, this additional complexity is uninteresting and so here we consider a staged pseudo-UTM whose alphabet iswhere the union is disjoint where is the blank symbol (which is distinct from the blank symbol of M ). Such a machine is capable of simulating any machine with tape alphabet Σ and set of states Q but cannot simulate arbitrary machines and is not a UTM in the standard sense. The adjective staged refers to the design of the UTM, which we now explain. The set of states is Q UTM = { compSymbol, compState, copySymbol, copyState, copyDir, ¬compState, ¬copySymbol, ¬copyState, ¬copyDir, updateSymbol, updateState, updateDir, resetDescr }. The UTM has four tapes numbered from 0 to 3, which we refer to as the description tape, the staging tape, the state tape and the working tape respectively. Initially the description tape contains a string of the form Xs 0 q 0 s 0 q 0 d 0 s 1 q 1 s 1 q 1 d 1 . . . s N q N s N q N d N X, corresponding to the tuples which define M , with the tape head initially on s 0 . The staging tape is initially a string XXX with the tape head over the second X. The state tape has a single square containing some distribution in ∆Q, corresponding to the initial state of the simulated machine M , with the tape head over that square. Each square on the the working tape is some distribution in ∆Σ with only finitely many distributions different from . The UTM is initialized in state compSymbol.The operation of the UTM is outlined in Figure 6 . It consists of two phases; the scan phase (middle and right path), and the update phase (left path). During the scan phase, the description tape is scanned from left to right, and the first two squares of each tuple are compared to the contents of the working tape and state tape respectively. If both agree, then the last three symbols of the tuple are written to the staging tape (middle path), otherwise the tuple is ignored (right path). Once the X at the end of the description tape is reached, the UTM begins the update phase, wherein the three symbols on the staging tape are then used to print the new symbol on the working tape, to update the simulated state on the state tape, and to move the working tape head in the appropriate direction. The tape head on the description tape is then reset to the initial X. Remark E.1. One could imagine a variant of the UTM which did not include a staging tape, instead performing the actions on the work and state tape directly upon reading the appropriate tuple on the description tape. However, this is problematic when the contents of the state or working tape are distributions, as the exact time-step of the simulated machine can become unsynchronised, increasing entropy. As a simple example, suppose that the contents of the state tape were 0.5q + 0.5p, and the symbol under the working tape head was s. Upon encountering the tuple sqs q R, the machine would enter a superposition of states corresponding to the tape head having both moved right and not moved, complicating the future behaviour.We define the period of the UTM to be the smallest nonzero time interval taken for the tape head on the description tape to return to the initial X, and the machine to reenter the state compSymbol. If the number of tuples on the description tape is N , then the period of the UTM is T = 10N + 5. Moreover, other than the working tape, the position of the tape heads are T -periodic.

F SMOOTH TURING MACHINES

Let U be the staged pseudo-UTM of Appendix E. In defining the model p(y|x, w) associated to a synthesis problem in Section 2 we use a smooth relaxation ∆ step t of the step function of U. In this appendix we define the smooth relaxation of any Turing machine following Clift & Murfet (2018) .Let M = (Σ, Q, δ) be a Turing machine with a finite set of symbols Σ, a finite set of states Q and transition function δUsing these equations, we can state efficient update rules for the staging tape. We have(1λ l ).To enable efficient computation, we can express these equations using tensor calculus. Let λ = (λ 1 , . . . , λ N ) ∈ R N . We view θ : R N → RΣ ⊗ RQ as a tensor and so θ = (1λ l ), . . . , (1λ N ), 1 can be expressed in terms on the vector λ only. Similarly, P (q * = •) ∈ R N ⊗ R Q with P (q N ) = (1λ l ), . . . , (1λ N ), 1 .

