SOFTENED SYMBOL GROUNDING FOR NEURO-SYMBOLIC SYSTEMS

Abstract

Neuro-symbolic learning generally consists of two separated worlds, i.e., neural network training and symbolic constraint solving, whose success hinges on symbol grounding, a fundamental problem in AI. This paper presents a novel, softened symbol grounding process, bridging the gap between the two worlds, and resulting in an effective and efficient neuro-symbolic learning framework. Technically, the framework features (1) modeling of symbol solution states as a Boltzmann distribution, which avoids expensive state searching and facilitates mutually beneficial interactions between network training and symbolic reasoning; (2) a new MCMC technique leveraging projection and SMT solvers, which efficiently samples from disconnected symbol solution spaces; (3) an annealing mechanism that can escape from sub-optimal symbol groundings. Experiments with three representative neuro-symbolic learning tasks demonstrate that, owing to its superior symbol grounding capability, our framework successfully solves problems well beyond the frontier of the existing proposals.

1. INTRODUCTION

Neuro-symbolic systems have been proposed to connect neural network learning and symbolic constraint satisfaction (Garcez et al., 2019; Marra et al., 2021; Yu et al., 2021; Hitzler, 2022) . In these systems, the neural network component first recognizes the raw input as a symbol, which is further fed into the symbolic component to produce the final output (Yi et al., 2018; Li et al., 2020; Liang et al., 2017) . Such a neuro-symbolic paradigm has shown unprecedented capability and achieved impressive results in many tasks including visual question answering (Yi et al., 2018; Vedantam et al., 2019; Amizadeh et al., 2020) , vision-language navigation (Anderson et al., 2018; Fried et al., 2018) , and math word problem solving (Hong et al., 2021; Qin et al., 2021) , to name a few. As exemplified by Figure 1 , to maximize generalizability, such problems are usually cast in a weakly-supervised setting (Garcez et al., 2022) : the final output of the neuro-symbolic computation is provided as supervision during training rather than the label of intermediate symbols. Lacking direct supervised labels for network training appeals for an effective and efficient approach to solve the symbol grounding problem, i.e., establishing a feasible and generalizable mapping from the raw inputs to the latent symbols. Note that bypassing symbol grounding (by, e.g., regarding the problem as learning with logic constraints) is possible, but cannot achieve a satisfactory performance (Manhaeve et al., 2018; Xu et al., 2018; Pryor et al., 2022) . Existing methods incorporating symbol grounding in network learning heavily rely on a good initial model and perform poorly when starting from scratch (Dai et al., 2019; Li et al., 2020; Huang et al., 2021) . A key challenge of symbol grounding lies in the semantic gap between neural learning which is stochastic and continuous, and symbolic reasoning which is deterministic and discrete. To bridge the gap, we propose to soften the symbol grounding. That is, instead of directly searching for a deterministic input-symbol mapping, we optimize their Boltzmann distribution, with an annealing strategy to gradually converge to the deterministic one. Intuitively, the softened Boltzmann distribution provides a playground where the search of input-symbol mappings can be guided by the neural network, and the network training can be supervised by sampling from the distribution. Game theory indeed provides a theoretical support for this strategy (Conitzer, 2016) : the softening makes the learning process a series of mixed-strategy games during the annealing process, which encourages stronger interactions between the neural and symbolic worlds. The remaining challenge is how to efficiently sample the feasible input-symbol mappings. Specifically, feasible solutions are extremely sparse in the entire symbol space and different solutions are poorly connected, which prevents the Markov Chain Monte Carlo (MCMC) sampling from efficiently exploring the solution space. To overcome this deficiency, we leverage the projection technique to accelerate the random walk for sampling (Feng et al., 2021b) , aided by satisfiability modulo theory (SMT) solvers (Nieuwenhuis et al., 2006; Moura & Bjørner, 2008) . The intuition is that disconnected solutions in a high-dimensional space may become connected when they are projected onto a low-dimensional space, resulting in a rapid mixing time of the MCMC sampling (Feng et al., 2021a) . The SMT solver, which is called on demand, is used as a generic approach to compute the inverse projection. Although MCMC sampling and SMT solvers may introduce bias, the theoretical result confirms that it can be pleasantly offset by the proposed stochastic gradient descent algorithm.

2. SOFTENING SYMBOL GROUNDING

Throughout this paper, we refer to X as the input space of the neuro-symbolic system, and Z as its symbol space or state space (e.g., all legal and illegal arithmetic expressions in the HWF task). We consider the neuro-symbolic computing task which first trains a neural network (parameterized by θ), mapping a raw input x ∈ X to some latent state z ∈ Z with a (variational) probability distribution P θ (z|x). The state z is further fed into a predefined symbolic reasoning procedure to produce the final output y. The training data contains only the input x's and the corresponding y's, which casts the problem into the so-called weakly-supervised setting. In general, we formulate the pre-defined symbolic reasoning procedure and the output y as a set of symbolic constraint S y on the symbol space. For instance, in Figure 1 , the constraint specifies that the arithmetic expressions must evaluate to 42. We say a state z is feasible or satisfies the symbolic constraint if z ∈ S y . The major challenge in this neuro-symbolic learning paradigm lies in the symbol grounding problem, i.e., establishing a mapping h : X → Z from the raw input to a feasible state that satisfies the symbolic constraint. Specifically, an effective mapping h should enable the model to explain as many observations as possible. As a result, the symbol grounding problem on a given dataset {(x i , y i )} i=1,...,N can be formulated as min h min θ ℓ(θ) := - N i=1 log P θ (z i |x i ) s.t. z i = h(x i ) ∈ S y i , i = 1, . . . , N. A straightforward solution to the above formulation would first train a network for each feasible mapping, and then select the one that achieves the final output y with a maximum likelihood. However, this solution is impractical since the number of feasible mappings grows exponentially. An obvious shortcoming of the above solution is that the neural network learning process makes no use of the knowledge embodied in the symbolic constraint. Vice versa, searching for the best mapping is not guided by the network. To overcome this shortcoming, one can switch the minimization order in problem (1), and obtain a new but numerically equivalent problem: min θ min h ℓ(θ) := - N i=1 log P θ (z i |x i ) s.t. z i = h(x i ) ∈ S y i , i = 1, . . . , N . The optimization problem (2) first determines a "best" mapping based on the initial model, and then updates the model to fit this mapping. The two steps are iterated until no improvement can be made. However, this grounding strategy may easily get trapped into a local optimum. The reason is that, every time a feasible mapping h is achieved, h tends to direct the neural network to (over)fit itself. Because the mappings are deterministic and discrete, there is no smooth route to alternative feasible mappings that would further improve the network. This insufficient information exchange between network training and symbolic reasoning makes the success of symbol grounding highly dependent on the quality of the initial model. In this work, we propose to soften the symbol grounding to facilitate the interaction between neural perception and symbolic reasoning. Instead of directly searching for a deterministic mapping h, we first pursue an optimal probability distribution of h, and then gradually "sharpen" the distribution to obtain the final deterministic h. Formally, for each input x, we introduce a Boltzmann distribution Q ϕ over S y , parameterized by ϕ, to indicate the probability of each feasible state that satisfies the symbolic constraint. Then, the softened symbol grounding problem can be formulated as follows: min θ,ϕ ℓ(θ, ϕ) := - N i=1 z i ∈S y i Q i ϕ (z i ) log P θ (z i |x i ) + γQ i ϕ (z i ) log Q i ϕ (z i ) s.t. supp(Q i ϕ ) ⊆ S y i , i = 1, . . . , N, where MacKay et al., 2003) , and the decreasing of its coefficient γ ensures that the grounding can converge to a deterministic mapping h. Except for the case γ = 1 which yields the KL divergence between P θ and Q ϕ , we also examine two extreme cases: (1) when γ → +∞, Q ϕ is forced towards the uniform distribution, and thus the minimization only aims to restrict the support of P θ to S y ; (2) when γ → 0, Q ϕ is confined to a one-hot categorical distribution, reducing to directly search for the deterministic mapping h. supp(Q ϕ ) denotes the support of Q ϕ . The entropy term z∈Sy Q ϕ (z) log Q ϕ (z) is intro- duced to control the sharpness of Q ϕ ( Advantages. Game theory provides a perspective to understand why our softened strategy improves over problems (1) and ( 2) with better interaction between model training and symbolic reasoning. Either problem (1) or ( 2) can be viewed as a pure-strategy Stackelberg game. That is, both the model training and the symbolic reasoning are forced to take a certain action (e.g., selecting a deterministic mapping h) during optimization. In contrast, problem (P) can be seen as a Stackelberg game with mixed strategies, where the player takes a randomized action with the distribution Q ϕ . Compared with the pure strategy, a mixed strategy does provide more information of the game, and thus strictly improves the utility of model training (Letchford et al., 2014; Conitzer, 2016) . In addition, this softening technique also avoids enumeration and thus improves efficiency. In problem (1), for each input x, the minimization in its corresponding symbol z needs searching over the whole S y (i.e., enumerating all feasible states satisfying the symbolic constraint). This is generally an intractable #P-complete problem in theory (Arora & Barak, 2009) . Problem (P) circumvents this costly computation by estimating the expectation over Q ϕ , which can be efficiently computed with a tailored sampling strategy discussed in the next section.

3. MARKOV CHAIN MONTE CARLO ESTIMATE VIA PROJECTION

To simplify presentation, in this section, we consider a single data sample. Specifically, by removing the summation over all samples and dropping the superscripts, problem (P) can be formulated as min θ,ϕ ℓ(θ, ϕ) := z∈Sy Q ϕ (z) log P θ (z|x) + γQ ϕ (z) log Q ϕ (z), supp(Q ϕ ) ⊆ S y . This problem can be solved by alternating between the gradient descent step in θ and the minimization step in ϕ. The updates of θ and ϕ at the k-th iteration are θ k+1 = θ k -η∇ θ ℓ(θ k , ϕ k ), ϕ k+1 = arg min ϕ|supp(Q ϕ ) ℓ(θ k+1 , ϕ). Note that the closed-form solution Q ϕ * exists when P θ is fixed, ensuring the convergence of gradient descent in θ (Jin et al., 2020, Theorem 31) . For details, the lower-level problem, i.e., min ϕ ℓ(θ, ϕ), is strictly convex, and thus contains the unique minimum: Q ϕ * (z) = P θ (z|x) 1 γ / z ′ ∈Sy P θ (z ′ |x) 1 γ , if z ∈ S y , 0, otherwise. (5) Given the closed-form solution Q ϕ * , the loss function ℓ(θ, ϕ * ) and its gradient ∇ θ ℓ(θ, ϕ * ) can be estimated through Monte Carlo sampling on Q ϕ * . The remaining problem is how to sample Q ϕ * , which is challenging due to the unknown structure of S y . Existing methods usually sample from the entire symbol/state space Z, and then either reject the state z / ∈ S y (e.g., policy-gradient method (Williams, 1992)), or project the infeasible state z to S y (e.g., back-search method (Li et al., 2020) ). Unfortunately, these methods suffer from the sparsity problem, i.e., feasible z's are very sparse in Z, causing the policy-gradient to vanish and the back-search to fail. To overcome the sparsity problem, we propose to directly sample from the symbolic constraint S y (i.e., the solution space). By applying the Metropolis algorithm (Bhanot, 1988; Beichl & Sullivan, 2000) , the acceptance ratio of jumping from one feasible state z to another one z ′ does not vanish, and can be computed as τ = Q ϕ * (z ′ ) Q ϕ * (z) = P θ (z ′ |x) P θ (z|x) 1 γ . ( ) Hence the problem becomes: (1) how to generate an initial state z, and (2) how to jump from z to z ′ . For the former, a natural way is to leverage SMT solvers (Moura & Bjørner, 2008) . 1 For the latter, the most commonly used strategy is to achieve the new state via random walk (Sherlock et al., 2010) . However, there lacks a systematic random walk approach in the solution space, because the solution space is likely unconnected (Wigderson, 2019) , creating the so-called connectivity barrier. Inspired by Feng et al. (2021a) , we propose to overcome the connectivity barrier by the projection technique. Elaborately, we introduce a projection operator Π(•) : Z → Ω that maps the state space Z to a lower-dimension space Ω, and then apply the single-site Metropolis algorithm in Ω. The projection essentially compacts the state space, and thus significantly improves the connectivity of the solution space. Figure 2 illustrates the key idea. Consider the running example in Figure 1 , where S y requires that all expressions are evaluated to 42. The SMT solver, together with the standard single-site Metropolis (a.k.a. Metropolis-in-Gibbs) (Metropolis et al., 1953; Bai, 2009) , can easily derive an initial state (e.g., 4×9+3+3=42) satisfying the symbolic constraint, but cannot further explore other feasible states due to the connectivity barrier (Ermon et al., 2012) . 2 In contrast, in the lower-dimension space Ω, it is much easier to jump to another feasible state.

4. ALGORITHM AND ANALYSIS

The overall algorithm of our neuro-symbolic learning is shown in Alg. Convergence. In the ideal case, the gradient estimate ∇ℓ(θ) is unbiased, and the gradient descent in θ (i.e., the network parameters) can converge. However, the bias is introduced due to: 1) the approximate sampling of Metropolis algorithm (Jacob et al., 2017) ; 2) the inverse projection implemented by the SMT solver (Moura & Bjørner, 2008) . For the former, we have to increase the number of inner iterations in our algorithm or consider adaptive variants of the Metropolis algorithm. For the latter, we can alter the projection operator during the training process, or increase the dimension of projection space. Nevertheless, none of these methods can fully avoid the bias of gradient estimate. To this end, we provide a convergence result for the stochastic gradient descent with limited bias. Proposition 1. Assume the loss function ℓ(θ) is L-Lipschitz and ℓ-smooth, and let the actual sampling distribution be Q. Then, if the total variation distance d tv ( Q, Q * ) is bounded by ϵ, it holds after K steps of the stochastic gradient descent with learning rate η = α/( √ T + 1): 1 K K k=1 ∥∇ℓ(θ k )∥ 2 ≤ O ℓσ 2 + ∆ 0 α √ T + 1 + (nϵL) 2 , where ∆ 0 = ℓ(θ 0 ) -min ℓ(θ), and n is the cardinal number of supp(Q * ). Remarks. A proof is given in Appendix A.1. This proposition states that the stochastic gradient descent with MCMC gradient estimate converges to an approximate stationary point. Moreover, the bias term is gradually wiped out in the training process, since the decreasing of γ shrinks the support of Q * , making the gradient estimate finally align with the true one. Generalization of existing methods. Existing neuro-symbolic learning frameworks, viz. semantic loss (Xu et al., 2018) , deepproblog (Manhaeve et al., 2018) , and neural-grammar-symbolic learning (Li et al., 2020) , can be understood as special cases of our framework. Proposition 2. All three frameworks (semantic loss, deepproblog, and neural-grammar-symbolic learning) share the same loss function l(θ) := - N i=1 z i ∈S i y log P θ (z i |x i ), and they are equivalent to Problem (P) with a fixed γ (γ = 1). Here, the equivalence means that the problems have the same optimal solution and gradient descent dynamics. Remarks. The proof is in Appendix A.2. Compared with minimizing l, our framework enjoys two advantages: (i) the Boltzmann distribution Q is explicitly expressed, making the sampling tractable and easy to implement even when the state space is very large; (ii) the annealing strategy of γ largely alleviates the sensitivity to the initial point, guiding to a better optimal solution. Annealing strategy. Next, we discuss the decreasing strategy of γ. By setting ϕ * z = -log P θ (z|x) as the entropy for each state z (Thomas & Joy, 2006) , we can obtain that Q ϕ * (z) = exp(-ϕ * z /γ) z ′ ∈Sy exp(-ϕ * z /γ) . It should be noted that the entropy ϕ * z is essentially the energy of that state z, and the coefficient γ plays a role of temperature in the Boltzmann distribution (LeCun et al., 2006) . From this perspective, it is natural to use some classic annealing (or cooling) schedules to decrease γ (Hajek, 1988; Nourani & Andresen, 1998; Henderson et al., 2003) . In this work, we consider the following three schedules: (1) logarithmic cooling schedule, i.e., γ t = γ 0 /log(1 + t); (2) exponential cooling schedule, i.e., γ t = γ 0 α t ; (3) linear cooling schedule, i.e., γ t = γ 0 -αt. Furthermore, after the annealing stage, i.e., when the temperature is decreased to a small value, we will directly set γ = 0, and train the network by a few more epochs. Note that in this zero-degree stage, the problem is essentially reduced to a semi-supervised setting (Lee et al., 2013) . That is, Q ϕ * shrinks to a one-hot categorical distribution when γ = 0, contributing some (pseudo) labels for the learning process. Some semi-supervised techniques could be applicable to this case, but are not sufficiently efficient due to the massive state space. Therefore, we use the simplest strategy, i.e., only train by those examples with predicted symbols satisfying the symbolic constraint.

5. EXPERIMENTS AND RESULTS

We carry out experiments on three tasks, viz. handwritten formula evaluation (HWF), visual Sudoku classification (Sudoku), and single-destination shortest path prediction in weighted graphs (SDSP). For the proposed approach, we split it into two stages, i.e., Stage I: Annealing (γ-decreasing) stage, and Stage II: Zero-degree (γ = 0) stage, and separately evaluate Stage I and Stage I+II. For the first state, we employ three different cooling schedules (Log, Exp, and Linear) as discussed in Section 4. The projection operator is specific to each task, and the corresponding inverse projection operator is implemented by the Z3 SMT solver (Moura & Bjørner, 2008) . Through parallel computation (Joblib Development Team, 2020) , Z3 solves the inverse projection (on average) in 2.8ms∼6.4ms per example, which is generally acceptable for batch gradient descent. We compare our approach with the existing state-of-the-art methods, which can be divided into two categories, viz., policy-gradient-based approaches, and symbolic-parser-based approaches. The former includes RL (i.e., learning with REINFORCE) and MAPO (Liang et al., 2018) (i.e., learning by Memory Augmented Policy Optimization). For the latter, most existing methods (e.g., semantic loss (Xu et al., 2018) and deepproblog (Manhaeve et al., 2018) Our methods are better than competitors and close to the direct supervision case. tasks (Huang et al., 2021) . Hence, based on Proposition 2 and borrowing our projection-based MCMC technique, we implement a stochastic version for them (referred to as SSL henceforth) for comparison. More implementation details can be found in Appendix B. The code is available at https://github.com/SoftWiser-group/Soften-NeSy-learning.

5.1. HANDWRITTEN FORMULA EVALUATION

We first evaluate our approach on the handwritten formula dataset provided by Li et al. (2020) . Since the original dataset consists of formulas with lengths varying from 1 to 7, which may lead to the label leakage problem, we only take the 6K/1.2K formulas with length 7 as the training/test set. In this neuro-symbolic system, the neural network is required to recognize symbols including digits 1-9 and basic operators (+, -, ×, ÷). The symbolic module evaluates the expression via the Python program 'eval'. In this task, we also compare with the neural-grammar-symbolic (NGS) method (Li et al., 2020) , and a special case of our approach with no-annealing strategy (NA) where we fix γ = 0.001. For SSL, NA, and our approach, we define the projection operator as Π(z 1 ; . . . ; z 7 ) = [z 1 ; z 2 ; z 4 ; z 6 ; z 7 ], i.e., drop the third and fifth symbols in the formula. 3We report the symbol accuracy (i.e., the percentage of symbols that are correctly predicted) and the calculation accuracy (i.e., the percentage of final results that are correctly calculated) in Table 1 . Observe that our approaches (Log, Exp, and Linear) significantly outperform the competitors. Additionally, when Stage II is included, both symbol accuracy and (especially) calculation accuracy can be further improved. Overall, our two-stage algorithm with the Exp annealing strategy achieves the best performance on both symbol accuracy and calculation accuracy. The Log annealing strategy in Stage II cannot obtain a comparable result with the other two strategies, because its temperature is not reduced to a sufficiently low value. Additional learning curve results and analysis can be found in Appendix B.4.

5.2. VISUAL SUDOKU CLASSIFICATION

We next evaluate our approach on a visual Sudoku classification task (Wang et al., 2019) , where the neural network recognizes the digits (i.e., MNIST images) in the Sudoku board, and the symbolic module determines whether a solution is valid for the puzzle. To evaluate the sample efficiency of our approach, we vary the size of the training set by 50, 100, 300, and 500, and the size of the test set is fixed at 1,000. Note that the solution space in this task is intrinsically connected. For example, one can easily obtain a new solution by permuting any two digits. Therefore, we additionally include this strategy without the projection (denoted by MCMC) as a baseline. For a given 4-by-4 Sudoku puzzle, we divide it into four disjoint 2-by-2 subboards, and the projection drops the anti-diagonal two. In the projection space, we randomly switch two digits in different rows or columns, and the following example illustrates the whole projection process. 2 4 3 1 3 1 4 2 4 2 1 3 1 3 2 4 Projection -----→ 2 4 3 1 1 3 2 4 Random walk ----→ 2 4 3 1 3 1 2 4 Inverse projection -----→ 2 4 1 3  3 1 4 2  4 2 3 1  1 3 2 4 Table 2 shows the accuracy result, i.e., the percentage of correctly predicted boards. It can be seen that RL, MAPO, and SSL fail to obtain a sensible result across all cases. Although the crude MCMC method can achieve a good result, it is still significantly outperformed by our approaches. The reason is that the Markov chain obtained in the original space has a slow mixing time, making the MCMC algorithm prone to getting stuck at local minima. To investigate the grounding effect and the sample efficiency, we also report the number of training examples that are correctly grounded (i.e., satisfying the symbolic constraints) in the brackets of Table 2 . The result shows the high sample efficiency of our approach. Particularly, with the number of training puzzles increased, the rate of correctly grounded examples has exceeded 90%. 

5.3. SHORTEST PATH SEARCH

We finally conduct a single-destination shortest path search task. In this neuro-symbolic system, the symbolic reasoning part implements an A * search algorithm (Russell, 2010), which maintains a priority queue of the estimated distance d(n) = g(n) + f θ (n), where n is the next node on the path, g(n) is the known distance from the start node to n, and f θ (n) is the shortest distance from n to the destination heuristically predicted by a neural network. For simplicity, we set the queue length to 1, i.e., only visit the node with the shortest estimated distance. We randomly generate 3K/1K graphs as training/test set through NetworkX (Hagberg et al., 2008) . In each graph, the number of vertices is fixed to 30, and the weights of edges are uniformly sampled among {1, 2, . . . , 9}. For this regression task, we define the symbol z as a multivariate Gaussian with diagonal covariance, i.e., z ∼ N (f θ (x), σ 2 I), where f θ (x) indicates the predicted distances from all nodes to the given destination. The dimension of z is 30, and the projection is defined by dropping [z 5 , z 10 , z 15 , z 20 , z 25 ]. The random walk in each step selects a component of z and adds a uniform noise from [-5, 5] on it. Figure 3 shows the accuracy results, i.e., the percentage of shortest paths that are correctly obtained. To better understand the effectiveness, we additionally train a reference model (denoted as SUP) with directly supervised labels (i.e., the actual distance from each node to the destination). It can be observed that, our approaches not only outperform the existing competitors, but also achieve a comparable result with the directly supervised model SUP.

6. RELATED WORK

Neuro-symbolic learning. To build a robust computational model integrating concept acquisition and manipulation (Valiant, 2003) , neuro-symbolic computing provides an attractive way to reconcile neural learning with logical reasoning. Numerous studies have focused on symbol grounding to enable conceptualization for neural networks. An in-depth introduction can be found in recent surveys (Marra et al., 2021; Garcez et al., 2022) . According to the way the symbolic reasoning component is handled, we categorize the existing work as follows. Learning with logical constraints. Methods in the first category parse the symbolic reasoning into an explicit logical constraint, and then translate the logical constraint into a differentiable loss function which is incorporated as constraints or regularizations in network training. Learning from symbolic reasoning. Another way is to regard the network's output as a predicate, and then maximize the likelihood of correct symbolic reasoning learning from entailment (Raedt et al., 2016, Sec. 7) ). In some of these methods (Manhaeve et al., 2018; Yang et al., 2020; Pryor et al., 2022; Winters et al., 2022) , the symbol grounding is often conducted in an implicit manner (as shown by Proposition 2), which limits the efficacy of network learning. Some other methods (Li et al., 2020; Dai et al., 2019) achieve an explicit symbol grounding in an abductive way, but still highly depend on a good initial model. Our proposed method falls into this category, but it not only explicitly models the symbol grounding, but also alleviates the sensitivity of the initial model by enabling the interaction between neural learning and symbolic reasoning. Differentiable logical reasoning. This line of work considers emulating symbolic reasoning through a differentiable component, and embedding it into complex network architectures. To achieve this goal, a series of techniques (Trask et al., 2018; Grover et al., 2019; Wang et al., 2019; Chen et al., 2021) are proposed to approximate different modules in logical reasoning. Despite the success, these methods still succumb to the symbol grounding problem, and cannot achieve a satisfactory performance without explicit supervision (Topan et al., 2021) . Constrained counting and sampling. Quite a few neuro-symbolic learning methods (Manhaeve et al., 2018; Xu et al., 2018) rely on knowledge compilation (Darwiche & Marquis, 2002) , which implements the exact constrained counting based on Binary Decision Diagram (BDD) (Akers, 1978) or Sentential Decision Diagram (SDD) (Darwiche, 2011) . Some approximate versions (De Raedt et al., 2007; Manhaeve et al., 2021) are proposed to overcome the computational hardness (Valiant, 1979; Jerrum et al., 1986) , but are still inefficient and poorly scalable to large-size problems. Aided by the progress of SAT/SMT solving (Malik & Zhang, 2009; Vardi, 2014) , randomized approximate constrained counting/sampling approaches have been proposed, which are based on hashing (e.g., Chakraborty et al. (2013) ; Meel et al. ( 2016)) or MCMC (e.g. Wei et al. (2004) ; Gomes et al. (2006) ; Ermon et al. (2012) ). In particular, previous MCMC-based methods suffer from connectivity barriers. Our approach is based on MCMC, but leverages projection to overcome the connectivity barrier (Moitra, 2019; Feng et al., 2021a) . Moreover, our theoretical result shows that the stochastic gradient descent can offset the possible bias of the gradient estimate introduced by MCMC and SMT solvers.

7. CONCLUSION

In this paper, we present a new neuro-symbolic learning framework for better integrating neural network learning and symbolic reasoning. To focus on the crucial problem of symbol grounding, we limit this work to scenarios where the symbolic reasoning logic is given as a priori knowledge. The next step is to incorporate the learning of the knowledge into our framework by, e.g., inductive logic programming. Moreover, though SMT solvers make the projection feasible in a broad range of settings, they may become a bottleneck when instantiating our framework for more complex systems. It would be interesting to consider a substitute for SMT solvers in neuro-symbolic frameworks.



Current SMT solvers are mainly designed for the satisfaction problem, namely, they are efficient in finding a solution, but underperform in generating all solutions. To maintain stability, the single-site Metropolis conducts random walk in a component-wise way. That is, in each iteration, it randomly selects and updates an individual component in the current state z to generate a new state z ′ . However, it can be observed that the update of any individual component (i.e., '4','×', '9', '+', '3', '+', '3') will result in no feasible new states. We observe that the solution space is well-connected through different projections. For example, for the used projection with initial γ0 = 1.0, around 46% solutions successfully jump to other solutions in an epoch.



Figure 1: An example neural-symbolic system for handwritten formula evaluation. It takes a handwritten arithmetic expression x as input and evaluates the expression to output y. The neural network component M θ recognizes the symbols z (i.e., digits and operators) in the expression, and the symbolic component evaluates the recognized formula by, e.g., the Python function 'eval'. The challenge in training M θ comes from the lack of explicit z to bridge the gap between the neural world (x to z) and the symbol world (z to y). Through softened symbol grounding, the model training and the constraint satisfaction join force to resolve the latent z to fit both the given x and y.

z = [4;×;9;+;3;+;3] Projection z' = [4;×;8;+;3;+;7] Inverse projection via SMT Random walk in low-dimension space Ω Unconnected solutions in high-dimension space 𝒵 [×;9;+;3;+;] [×;8;+;3;+;]

Figure 2: Sampling unconnected solutions via projection. For our running example, we use the projection Π([z 1 ; . . . ; z 7 ]) = [z 2 ; . . . ; z 6 ], i.e., dropping the first and the last digits. The current state z = [4; ×; 9; +; 3; +; 3] is projected to Π(z) = [×; 9; +; 3; +; ]. We then randomly select an individual component, say, '9', and update it to '8'. Next, a new feasible state (e.g., 4×8+3+7=42) is derived by computing the inverse projection of Π(z ′ ) = [×; 8; +; 3; +; ] with an SMT solver.

Figure 3: Accuracy (%) of the SDSP task. Our methods are better than competitors and close to the direct supervision case.

1. In a nutshell, we first conduct a few random walk steps to sample a new state z on the distribution Q ϕ * ; we then estimate the gradient based on z, and conduct one stochastic gradient descent step. As shown byFeng  et al. (2021a), under some proper assumptions, the Metropolis algorithm enjoys the rapid mixing property on the projection space(Levin & Peres, 2017). Therefore, we can efficiently construct the approximate sampling on Q ϕ * , without taking too many steps in the Metropolis algorithm. Additionally, both the sampling method and the SMT solver can be paralleled for different examples, hence the batch gradient descent is well supported in our framework. Train the network Estimate the gradient ∇ℓ(θ) = -∇ θ log P θ (z|x).Update network parameters θ by the stochastic gradient decent. Decrease the coefficient γ. end for

) are intractable in the studied

Accuracy (%) of the Sudoku task. Our methods significantly outperform the competitors.

Although several methods, e.g.,Hu et al. (2016);Xu et al. (2018);Nandwani et al. (2019);Fischer et al. (2019);Hoernle et al. (2022), are proposed to deal with many different forms of logical constraints, most of them tend to avoid the symbol grounding. In other words, they often only confine the network's behavior, but do not guide the conceptualization for the network.

ACKNOWLEDGMENT

We thank the anonymous reviewers for their insightful comments and suggestions. This work is supported by the National Natural Science Foundation of China (Grants #62025202, #62172199). T. Chen is also partially supported by Birkbeck BEI School Project (EFFECT) and an overseas grant of the State Key Laboratory of Novel Software Technology under Grant #KFKT2022A03. Xiaoxing Ma (xxm@nju.edu.cn) is the corresponding author.

A PROOFS

A.1 PROOF OF PROPOSITION 1 Proof. We define ℓ(θ) = E z∼Q * log P θ (z|x) as the objective function, and consider the bias of gradient on different distributions Q and Q * , which is denoted by m(θ).Note that our sampling strategy ensures that only feasible states will be generated, and thus we haveHence, through the Cauchy-Schwarz inequality, we can obtain thatwhere S Q * represents the support of Q * and n denotes its cardinal number. Now, by applying Lemma 3 in Ajalloeian & Stich (2020) , we can obtain thatwhere σ 2 is the bounded variance in gradient estimate.

A.2 PROOF OF PROPOSITION 2

Proof. It can be observed that the loss function l(θ) := -log z∈Sy P θ (z|x) is essentially the semantic loss (Xu et al., 2018, Def. 1) , as well as the loss used in Deepproblog (Raedt et al., 2016, Sec. 7 ) and NGS (Li et al., 2020, Eq. 7) . Now, we consider the gradient ∇ l(θ), which can be computed asLet us switch to Problem (P). By setting γ = 1, for any z ∈ S(y), we can compute Q ϕ * by. and thus ∇ θ ℓ(θ, ϕ * ) can be rewritten asFor completeness, we also simply prove that the Q ϕ * is the optimal solution of min ϕ ℓ(θ, ϕ) with γ = 1. Elaborately, the Lagrangian function of the lower-level problem isBy computing its gradient in Q ϕ (z), and let it vanish, thenshould hold for any z ∈ S y . Therefore, we havePutting these two equalities together, we can obtain that, which completes the proof.

B EXPERIMENTS B.1 TWO-STAGE ALGORITHM

In this subsection, we briefly discuss the proposed two-stage algorithm used in Section 5. Recall the two stages are: Stage I: Annealing (γ-decreasing) stage, and Stage II: Zero-degree (γ = 0) stage.Stage I faithfully implements Algorithm 1. During Stage I training, with the temperature γ decreasing, Q ϕ * gradually converges to a one-hot categorical distribution, which will finally give a deterministic input-symbol mapping (i.e., a pseudo label). Ideally, if the solution space can be properly enumerated, we can start the Stage II algorithm in a fully-supervised way, i.e., fine-tuning the network by these deterministic mappings. However, the solution space is discrete and grows exponentially, and thus it is intractable to determine the mapping for each input.To this end, we conduct Stage II in a semi-supervised way. That is, when fine-tuning the network, we only use the deterministic mappings that can be easily determined, and drop the others. Elaborately, for the given input, if the model's prediction satisfies the symbolic constraint, Q ϕ * can be directly computed according to equation 5. Hence, we only use these inputs as the training data in Stage II, leading to a semi-supervised setup.

B.2 FRAMEWORK GUIDELINE

Two key elements in our framework are annealing strategy and projection operator. Hence, we briefly discuss how to set the temperature in the annealing strategy and the projection operator.(1) The setting of temperature. Intuitively, a good initial temperature should ensure the new state will not be rejected at the first few training epochs. Therefore, setting the initial temperature to a large value (e.g., γ 0 = 1 in our three tasks) is generally effective. For the hyperparameter setting in the annealing strategy, we recommend to follow that of Nourani & Andresen (1998) .(2) The selection of projection operator. The selection of variables to be dropped by the projection operator is very critical in our framework. Feng et al. (2021a) propose to evaluate the quality of projection operator via entropy, which hints at choosing the variables with less entropy decreasing.A more direct and practical guideline is to drop variables that are highly correlated to others, because these variables depend on others and thus have lower entropy.(3) The setting of projection dimension. The dimension of the projection space Ω requires a trade-off: a larger dim(Ω) cannot effectively improve the connectivity of solutions, while a smaller dim(Ω) may introduce more bias by the SMT solver. A practical method to determine dim(Ω) may be via trail-and-error, i.e., to gradually decrease the dimension of the projection space until the connectivity of Ω is satisfactory. Furthermore, there are different methods which may be used to measure the connectivity of the solution space. In theory, one may count the number of connected components of the solution space, which is not very practical. In our experiments, as we carry out random walk, we adopt the number of random walk steps needed for the transition from the initial solution to a target solution. For example, in the HWF task, dim(Ω) is set as 5, since we observe that the solutions are fully connected by dropping the third and fifth symbols. We only plot the curves for some of the methods for brevity. Our approaches (Log and Linear) achieve the best symbol accuracy on the training set, and also generalize better to the test set.

B.3 EXPERIMENTAL SETTING

Model architecture. For HWF and Sudoku tasks, we used the LeNet-5 architecture; For SDSP task, we used the multilayer perceptron with 30×30 input neurons, one hidden layers with 128 neurons, and an output layer of 30 neurons.Batch size and epoch. For all tasks, the batch size was set to 64. For all comparison methods and our Stage I algorithm, the number of epochs is fixed to 1,000. For our Stage II algorithm, the number of epochs is fixed at 30. We fix T = 10 in Alg. 1, i.e., conducting ten random walk steps before one gradient descent step.Gradient descent algorithm. For all comparison methods, we followed the learning algorithm setting in their respective Github repository. To be specific, RL, MAPO, and SSL conducted the Adam algorithm with learning rate 5e-4. For our approaches, we used the SGD algorithm with learning rate 0.1 in Stage I, and the Adam algorithm with learning rate 1e-3.Implementation. For RL, MAPO, and NGS methods, we used the code provided by NGS authors. For SSL and NA methods, we implemented them with the same projection technique and random walk strategy with our approach. The temperature γ is fixed to 0.001 in the NA method.

B.4 ADDITIONAL RESULTS

For the HWF task, we plot the training/test curves of our Stage I algorithms (Log and Linear) and comparison methods (MAPO, NGS, SSL, and NA) in Figure 4 . For our approaches, the random walk step is also counted within the iteration. First, it can be observed that the policy-gradientbased method (MAPO) cannot well fit the training data due to the issue of sparse reward. For the NGS method, it quickly overfits the training set, but cannot improving the symbol accuracy and generalizing to the test set. This result is not surprising because the back-search in NGS is too greedy and hence only works with a good initial model. SSL and NA can be treated as two different variants of our framework, and they learn well during the first few epochs, but then collapse due to the lack of an effective grounding.In Table 3 , we further report some results of additional experiments on the HWF task. We consider different combinations of our method with the existing methods. We first apply the Stage II Published as a conference paper at ICLR 2023 algorithm for SSL and NA. However, such variants collapse since they cannot provide a sufficient calculation accuracy, and finally converge to nearly zero calculation accuracy. We next apply the back-search in NGS after our Stage I algorithm, by initializing with our Stage I models. This variant can achieve comparable results with that using direct supervision. Note that a bit accuracy drop compared with that in the original NGS paper is due to that we only evaluate the model on length-7 formulas. This result further verifies the effectiveness of our softened symbol grounding. However, the back-search in NGS lacks versatility in more complex settings and is not applicable to other studied tasks.

