LEARNING WITH LOGICAL CONSTRAINTS BUT WITH-OUT SHORTCUT SATISFACTION

Abstract

Recent studies have explored the integration of logical knowledge into deep learning via encoding logical constraints as an additional loss function. However, existing approaches tend to vacuously satisfy logical constraints through shortcuts, failing to fully exploit the knowledge. In this paper, we present a new framework for learning with logical constraints. Specifically, we address the shortcut satisfaction issue by introducing dual variables for logical connectives, encoding how the constraint is satisfied. We further propose a variational framework where the encoded logical constraint is expressed as a distributional loss that is compatible with the model's original training loss. The theoretical analysis shows that the proposed approach bears salient properties, and the experimental evaluations demonstrate its superior performance in both model generalizability and constraint satisfaction.

1. INTRODUCTION

There have been renewed interests in equipping deep neural networks (DNNs) with symbolic knowledge such as logical constraints/formulas (Hu et al., 2016; Xu et al., 2018; Fischer et al., 2019; Nandwani et al., 2019; Li & Srikumar, 2019; Awasthi et al., 2020; Hoernle et al., 2021) . Typically, existing work first translates the given logical constraint into a differentiable loss function, and then incorporates it as a penalty term in the original training loss of the DNN. The benefits of this integration have been well-demonstrated: it not only improves the performance, but also enhances the interpretability via regulating the model behavior to satisfy particular logical constraints. Despite the encouraging progress, existing approaches tend to suffer from the shortcut satisfaction problem, i.e., the model overfits to a particular (easy) satisfying assignment of the given logical constraint. However, not all satisfying assignments are the truth, and different inputs may require different assignments to satisfy the same constraint. An illustrative example is given in Figure 1 . Essentially, the example considers a logical constraint P → Q, which holds when (P, Q) = (T, T) or (P, Q) = (F, F)/(F, T). However, it is observed that existing approaches tend to simply satisfy the constraint via assigning F to P for all inputs, even when the real meaning of the logic constraint is arguably (P, Q) = (T, T) for certain inputs (e.g., class '6' in the example). To escape from the trap of shortcut satisfaction, we propose to consider how a logical constraint is satisfied by distinguishing between different satisfying assignments of the constraint for different inputs. The challenge here is the lack of direct supervision information of how a constraint is satisfied other than its truth value. However, our insight is that, by addressing this "harder" problem, we can make more room for the conciliation between logic information and training data, and achieve better model performance and logic satisfaction at the same time. To this end, when translating a logical constraint into a loss function, we introduce a dual variable for each operand of the logical connectives in the conjunctive normal form (CNF) of the logical constraint. The dual variables, together with the softened truth values for logical variables, provide a working interpretation for the satisfaction of the logical constraint. Take the example in Figure 1 : for the satisfaction of P → Q, we consider its CNF ¬P ∨ Q and introduce two variables τ 1 and τ 2 to indicate the weights of the Figure 1 : Consider a semi-supervised classification task of handwritten digit recognition. For the illustration purpose, we remove the labels of training images in class '6', but introduce a logical rule P := (f (R(x)) = 9) → Q := (f (x) = 6) to predict '6', where R(x) stands for rotating the image x by 180 • . The ideal satisfying assignments should be (P, Q) = (T, T) for class '6'. However, existing methods (e.g., DL2 (Fischer et al., 2019) ) tend to vacuously satisfy the rule by discouraging the satisfaction of P for all inputs, including those actually in class '6'. In contrast, our approach successfully learns to satisfy Q when P holds for class '6', even achieving comparable accuracy (98.8%) to the fully supervised setting. softened truth values of ¬P and Q, respectively. The blue dashed lines in the right part of Figure 1 (the P -Satisfaction and Q-Satisfaction subfigues) indicate that the dual variables gradually converge to the intended weights for class '6'. Based on the dual variables, we then convert logical conjunction and disjunction into convex combinations of individual loss functions, which not only improves the training robustness, but also ensures monotonicity with respect to logical entailment, i.e., the smaller the loss, the higher the satisfaction. Note that most existing logic to loss translations do not enjoy this property but only ensure that the logical constraint is fully satisfied when the loss is zero; however, it is virtually infeasible to make the logical constraint fully satisfied in practice, rendering an unreliable training process towards constraint satisfaction. Another limitation of existing approaches lies in the incompatibility during joint training. That is, existing work mainly treats the translated logic loss as a penalty under a multi-objective learning framework, whose effectiveness strongly relies on the weight selection of each objective, and may suffer when the objectives compete (Kendall et al., 2018; Sener & Koltun, 2018) . In contrast, we introduce an additional random variable for the logical constraint to indicate its satisfaction degree, and formulate it as a distributional loss which is compatible with the neural network's original training loss under a variational framework. We cast the joint optimization of the prediction accuracy and constraint satisfaction as a game and propose a stochastic gradient descent ascent algorithm to solve it. Theoretical results show that the algorithm can successfully converge to a superset of local Nash equilibria, and thus settles the incompatibility problem to a large extent. In summary, this paper makes the following main contributions: 1) a new logic encoding method that translates logical constraints to loss functions, considering how the constraints are satisfied, in particular, to avoid shortcut satisfaction; 2) a variational framework that jointly and compatibly trains both the translated logic loss and the original training loss with theoretically guaranteed convergence; 3) extensive empirical evaluations on various tasks demonstrating the superior performance in both accuracy and constraint satisfaction, confirming the efficacy of the proposed approach.

2.1. LOGICAL CONSTRAINTS

For a given neural network, we denote the data point by (x, y) ∈ X × Y, and use w to represent the model parameters. We use variable v to denote the model's behavior of interest, which is represented as a function f w (x, y) parameterized by w. For instance, we can define p w (y = 1 | x) as the (predictive) confidence of y = 1 given x. We require f w (x, y) to be differentiable with respect to w. An atomic formula a is in the form of v ▷◁ c where ▷◁∈ {≤, <, ≥, >, =, ̸ =} and c is a constant. We express a logical constraint in the form of a logical formula, consisting of usual conjunction, disjunction, and negation of atomic formulas. For instance, we can use p w (y = 1 | x) ≥ 0.95 ∨ p w (y = 1 | x) ≤ 0.05 to specify that the confidence should be either no less than 95% or no greater than 5%. Note that v = c can be written as v ≥ c ∧ v ≤ c, so henceforth for simplicity, we only have atomic formulas of the form v ≤ c. We use v to indicate a state of v, which is a concrete instantiation of v over the current model parameters w and data point (x, y). We say a state v satisfies a logical formula α, denoted by v |= α, if α holds under v. For two logical formulas α and β, we say α |= β if any v that satisfies α also satisfies β. Moreover, we write α ≡ β if α is logically equivalent to β, i.e., they entail each other.

2.2. LOGICAL CONSTRAINT TRANSLATION

(A) Atomic formulas. For an atomic formula a := v ≤ c, our goal is to learn the model parameters w * to encourage that the current state v * = f w * (x, y) can satisfy the formula for the given data point (x, y). For this purpose, we define a cost function S a (v) := max(v -c, 0), and switch to minimize it. It is not difficult to show that the atomic formula holds if and only if w * is an optimal solution of min w S(w) = max(f w (x, y) -c, 0) with S(w * ) = 0. For example, for the predictive confidence p w (y = 1 | x) ≥ 0.95 introduced before, the corresponding cost function is defined as max(0.95 -p w (y = 1 | x), 0), which is actually the norm distance between the current state v and the satisfied states of the atomic constraint. Such translation not only allows to find model parameters w * efficiently by optimization, but also paves the way to encode more complex logical constraints as discussed in the following. (B) Logical conjunction. For the logical conjunction l := v 1 ≤ c 1 ∧ v 2 ≤ c 2 , where v i = f w,i (x, y) for i = 1, 2, We first substitute the atomic formulas by their corresponding cost functions as defined in equation 1, and rewrite the conjunction as (S(v 1 ) = 0) ∧ (S(v 2 ) = 0). Then, the conjunction can be equivalently converted into a maximization form, i.e., max(S(v 1 ), S(v 2 )) = 0. Based on this conversion, we may follow the existing work and define the cost function of logical conjunction as S ∧ (l) := max(S(v 1 ), S(v 2 )). However, directly minimizing this cost function is less effective as it cannot well encode how the constraints are satisfied, and it is also not efficient since only one of S(v 1 ) and S(v 2 ) can be optimized in each iteration. Therefore, we introduce the dual variable τ i for each atomic formula, and further extend it to the general conjunction case for l : = ∧ k i=1 v i ≤ c i as S ∧ (l) = max τ1,...,τ k k i=1 τ i S(v i ), τ 1 , . . . , τ k ≥ 0, τ 1 + • • • + τ k = 1. (C) Logical disjunction. Similar to conjunction, the logical disjunction l := ∨ k i=1 v i ≤ c i . can be equivalently encoded into a minimization form. Hence, the corresponding cost function is as follows, S ∨ (l) = min τ1,...,τ k k i=1 τ i S(v i ), τ 1 , . . . , τ k ≥ 0, τ 1 + • • • + τ k = 1. (3) (D) Clausal Formulas. In general, for logical formula α = ∧ i∈I ∨ j∈J v ij ≤ c ij in the conjunctive normal form (CNF), the corresponding cost function can be defined as S α (v) = max µi min νij i∈I j∈J (i) µ i • ν ij • max(v ij -c ij , 0) s.t. i∈I µ i = 1, j∈J (i) ν ij = 1, µ i , ν ij ≥ 0, where µ i (i ∈ I) and ν ij (j ∈ J (i) ) are the dual variables for conjunction and disjunction, respectively. The proposed cost function establishes an equivalence between the logical formula and the optimization problem. This is summarized in the following theorem, whose proof is included in Appendix A. Theorem 1. Given the logical formula α = ∧ i∈I ∨ j∈J v ij ≤ c ij , if the dual variables {µ i , i ∈ I} and {ν ij , j ∈ J } of S α (v) converge to {µ * i , i ∈ I} and {ν * ij , j ∈ J }, then the cost function of α can be computed as S α (v) = max i∈I min j∈J {S ij := max(v ij -c ij , 0)}. Furthermore, the sufficient and necessary condition of v * = f w * (x, y) |= α is that w * is the optimal solution of min w S α (w), with the optimal value S α (w * ) = 0.

2.3. ADVANTAGES OF OUR TRANSLATION

Monotonicity. As shown in Theorem 1, the optimal w * ensures that model's behavior v * = f w * (x, y) can satisfy the target logical constraint α. Unfortunately, w * usually cannot achieve the optimum due to the coupling between the cost function S α (v) and the original training loss as well as the fact that S α (v) is usually non-convex. This essentially reveals the main rationale of our logical encoding, and we conclude it in the following theorem with the proof given in Appendix B. Theorem 2. For two logical formulas α = ∧ i∈I ∨ j∈J a (α) ij and β = ∧ i∈I ∨ j∈J a (β) ij , we have α |= β if and only if S α (v) ≥ S β (v) holds for any state of v. Remarks. Theorem 2 essentially states that, when dual variables converge, we can progressively achieve a higher satisfaction degree of the logical constraints, as the cost function continues to decrease. This is especially important in practice since it is usually infeasible to make the logical constraint fully satisfied. Interpretability. The introduced dual variables control the satisfaction degree of each individual atomic formula, and gradually converge towards the best valuation. Elaborately, the dual variables essentially learn how the given logical constraint can be satisfied for each data point, which resembles the structural parameter used in Bengio et al. (2020) to disentangle causal mechanisms. Namely, the optimal dual variables τ * disentangle the satisfaction degree of the entire logical formula into individual atomic formulas, and reveal the contribution of each atomic constraint to the entire logical constraint in a discriminative way. Consider the cost function S l (v) = max τ ∈[0,1] τ S(v 1 ) + (1τ )S(v 2 ) for l := a 1 ∨ a 2 . The expectation E v [τ * ] estimates the probability of p(a 1 → l | x), and a larger E v [τ * ] indicates a greater probability that the first atomic constraint is met. Robustness improvement. The dual variables also improve the stability of numerical computations. First, compared with some classic fuzzy logic operators (e.g., directly adopting max and min operators to replace the conjunction and disjunction (Zadeh, 1965; Elkan et al., 1994; Hájek, 2013) ), the dual variables alleviate the sensitivity to the initial point. Second, compared with other commonlyused translation strategies (e.g., using addition and multiplication in DL2 (Fischer et al., 2019) ), the dual variables balance the magnitude of the cost function, and further avoid some bad stationary points. Some concrete examples are included in Appendix C.

3.1. DISTRIBUTIONAL LOSS FOR LOGICAL CONSTRAINTS

Existing work mainly adopts a multi-objective learning framework to integrate logical constraints into DNNs, which is sensitive to the weight selection of each individual loss. In this work, we propose a variational framework (Blei et al., 2017) to achieve better compatability between the two loss functions without the need of weight selection. Generally speaking, the original training loss for a neural network is usually a distributional loss (e.g., cross entropy), which aims to construct a parametric distribution p w (y|x) that is close to the target distribution p(y|x). To keep the compatibility, we extend the distributional loss to the logical constraints. More concretely, we define an additional m-dimensional random variable z for the target logical constraint α and define z = S α (v), where v = f w (x, y). Here, we use vector z to indicate the combination of multiple variables (e.g., v 1 , • • • , v m ) in the logical constraint for training efficiency. 1Next, we frame the training as the distributional approximation between the parametric distribution and target distribution over y and z, and choose the KL divergence as the distributional loss, min w N i=1 KL(p(y i , z i |x i )∥p w (y i , z i |x i )). (5) By using Bayes' theorem, p(y, z|x) can be decomposed as p(y, z|x) = p(z|(x, y)) • p(y|x). Therefore, we can reformulate equation 5 as: min w N i=1 KL(p(y i |x i )∥p w (y i |x i )) + E yi|xi [KL(p(z i |x i , y i )∥p w (z i |x i , y i ))]. The detailed derivation can be found in Appendix D. In equation 6, the first term is the original training loss, and the second term is the distributional loss of the logical constraint. The remaining question is how to model the conditional probability distribution of the random variable z. Note that z = 0 indicates that the target logical constraint α is satisfied. Thus, the target distribution of z can be defined as a Dirac delta distribution (Dirac, 1981, Sec. 15) , i.e., p(z|x, y) is a zero-centered Gaussian with variance tending to zero (Morse & Feshbach, 1954, Chap. 4.8) . Furthermore, considering the non-negativity of z, we form the Dirac delta distribution as the limit of the sequence of truncated normal distributions on [0, +∞): p(z|x, y) = lim σ→0 T N (z; 0, σ 2 I) = lim σ→0 ( 2 σ ) m ϕ( z σ ), where ϕ(•) is the probability density function of the standard multivariate normal distribution. For the parametric distribution of z, we model p w (z|x, y) as the truncated normal distribution on [0, +∞) with mean µ = S α (w) and covariance diag(δ 2 ), p w (z|x, y) = 1 |diag(δ 2 )| ϕ( z-µ δ ) 1 -Φ(-µ δ ) , where Φ(•) is the cumulative distribution function of the standard multivariate normal distribution, and µ δ denotes element-wise division of vectors µ and δ. Hence, the final optimization problem of our learning framework is min w,δ N i=1 KL(p(y i |x i )∥p w (y i |x i )) + log |diag(δ)| + 1 2 ∥ µ i δ ∥ 2 + log(1 -Φ(- µ i δ )), where µ i = S α (v i ), v i = f w (x i , y i ). Different from the probabilistic soft logic that estimates the probability for each atomic formula, we establish the probability distribution T N (z; µ, diag(δ 2 )) of the target logical constraint. This allows to directly determine the variance δ 2 of the entire logical constraint rather than analyzing the correlation between atomic constraints. Moreover, the minimizations of w and δ can be viewed as a competitive game (Roughgarden, 2010; Schaefer & Anandkumar, 2019) . Roughly speaking, they first cooperate to achieve both higher model accuracy and higher degree of logical constraint satisfaction, and then compete between these two sides to finally reach a (local) Nash equilibrium (even though it may not exist). The local Nash equilibrium means that the model accuracy and logical constraint satisfaction cannot be improved at the same time, which shows a compatibility in the sense of game theory.

3.2. OPTIMIZATION PROCEDURE

Note that the dual variables τ ∧ and τ ∨ of the cost function µ i = S α (v i ) also need to be optimized during training. To solve the optimization problem, we utilize a stochastic gradient descent ascent (SGDA) algorithm (with min-oracle) (Lin et al., 2020) , and the details of the training algorithm is summarized in Appendix F. Specifically, we update w and τ ∨ through gradient descent, update τ ∧ through gradient ascent, and update δ by its approximate min-oracle, i.e., via its upper bound KL(p(z|x, y)∥p w (z|x, y)) ≤ log |diag(δ)|+ 1 2 ∥ µ δ ∥ 2 +const. Thus, the update of δ in each iteration is δ 2 = 1 N N i=1 µ i = 1 N N i=1 S α (v i ). The update of δ plays an important role in our algorithm, because the classic SGDA algorithm (i.e., direct alternating gradient descent on w and δ) may converge to limit cycles. Therefore, updating δ via its approximate min-oracle not only makes the iteration more efficient, but also ensures the convergence of our algorithm as summarized in Theorem 3. Theorem 3. Let υ(•) = min δ L(•, δ), and assume L(•) is L-Lipschitz. Algorithm 1 with step size η w = γ/ √ T + 1 ensures the output w of T iterations satisfies E[∥∇e υ/2κ (w)∥ 2 ] ≤ O κγ 2 L 2 + ∆ 0 γ √ T + 1 , where e υ (•) is the Moreau envelop of υ(•). Remarks. Theorem 3 states that the proposed algorithm with suitable stepsize can successfully converge to a point (w 

4. EXPERIMENTS AND RESULTS

We carry out experiments on four tasks, i.e., handwritten digit recognition, handwritten formula recognition, shortest distance prediction in a weighted graph, and CIFAR100 image classification. For each task, we train the model with normal cross-entropy loss on the labeled data as the baseline result, and compare our approach with PD (Nandwani et al., 2019) and DL2 (Fischer et al., 2019) , which are the state-of-the-art approaches that incorporate logical constraints into the trained models. For PD, there are two choices (Choice 1 and Choice 2 named by the authors) to translate logical constraints into loss functions which are denoted by PD 1 and PD 2 , respectively. We also compare with SL (Xu et al., 2018) and DPL (Manhaeve et al., 2018) in the first task. These two methods are intractable in the other three tasks as they both employ the knowledge compilation (Darwiche & Marquis, 2002) to translate logical constraint, which involves an implicit enumeration of all satisfying assignments. Each reported experimental result is derived by computing the average of five repeats. For each atomic formula a := v ≤ c, we consider it is satisfied if v ≤ c -tol where tol is a predefined tolerance threshold to relax the strict inequalities. We set tol = 0.01 on the three classification tasks and tol = 1 on the regression task (shortest distance prediction), respectively. More setup details can be found in Appendix H. The code, together with the experimental data, is available at https://github.com/SoftWiser-group/NeSy-without-Shortcuts.

4.1. HANDWRITTEN DIGIT RECOGNITION

In the first experiment, we construct a semi-supervised classification task by removing the labels of '6' in the MNIST dataset (LeCun et al., 1989) during training. We then apply a logical rule to predict label '6' using the rotation relation between '6' and '9' as f (x) = 9 → f (x) = 6, where x is the input, and x denotes the result of rotating x by 180 degrees. We rewrite the above rule as the disjunction (f (x) ̸ = 9) ∨ (f (x) = 6). We train the LeNet model on the MNIST dataset, and further validate the transferability performance of the model on the USPS dataset (Hull, 1994) . The classification accuracy and logical constraint satisfaction results of class '6' are shown in Table 1 . The ¬P-Sat. and Q-Sat. in the table indicate the satisfaction degrees of f (x) ̸ = 9 and f (x) = 6, respectively. At first glance, it is a bit strange that our approach significantly outperforms some alternatives on the accuracy, but achieves almost the same logic rule satisfaction. However, taking a closer look at how the logic rule is satisfied, we find that alternatives are all prone to learn an shortcut satisfaction (i.e., f (x) ̸ = 9) for the logical constraint, and thus cannot further improve the accuracy. In contrast, our approach learns to satisfy f (x) = 6 when f (R(x)) = 9, which is supposed to be learned from the logical rule. On the USPS dataset, we additionally train a reference model with full labels, and it achieves 82.1% accuracy. Observe that the domain shift deceases the accuracy of all methods, but our approach still obtains the highest accuracy (even comparable with the reference model) and constraint satisfaction. This is due to the fact that our approach better learns the prior knowledge in the logical constraint and thus yields better transferability. An additional experiment of a transfer learning task on the CIFAR10 dataset showing the transferability of our approach can be found in Appendix J.

4.2. HANDWRITTEN FORMULA RECOGNITION

We next evaluate our approach on a handwritten formula (HWF) dataset, which consists of 10K training formulas and 2K test formulas (Li et al., 2020) . The task is to predict the formula from the raw image, and then calculate the final result based on the prediction. We adopt a simple grammar rule in the mathematical formula, i.e., four basic operators (+, -, ×, ÷) cannot appear in adjacent positions. For example, "3 +×9" is not a valid formula according to the grammar. Hence, given a formula with k symbols, the target logical constraint is formulated as l := ∧ k-1 i=1 (a i1 ∨ a i2 ), with a i1 := p digits (x i ) + p digits (x i+1 ) = 2 and a i2 := p ops (x i ) + p ops (x i+1 ) = 1, where p digits (x) and p ops (x) represent the predictive probability that x is a digit or an operator, respectively. This logical constraint emphasizes that either any two adjacent symbols are both numbers (encoded by a i1 ), or only one of them is a basic operator (encoded by a i2 ). To show the efficacy of our approach, we cast this task in a semi-supervised setting, in which only a very small part of labeled data is available, but the logical constraints of unlabeled data can be used during the training process. The calculation accuracy and logical constraint satisfaction results are provided in Table 2 . In the table, we consider two settings of data splits, i.e., using 2% labeled data and 80% unlabeled data, as well as using 5% labeled data and 20% unlabeled data in the training set. It is observed that our approach achieves the highest accuracy and logical constraint satisfaction in both cases, demonstrating the effectiveness of the proposed approach.

4.3. SHORTEST DISTANCE PREDICTION

We next consider a regression task, i.e., to predict the length of the shortest path between two vertices in a weighted connected graph G = (V, E). Basic properties should be respected by the model's prediction f (•), i.e., for any v i , v j , v k ∈ V , the prediction should obey (1) symmetry:  f (v i , v j ) = f (v j , v i ) and (2) triangle inequality: f (v i , v j ) ≤ f (v i , v k ) + f (v k , v j ). We train an MLP that takes the adjacency matrix of the graph as the input, and outputs the predicted shortest distances from the source node to all the other nodes. In the experiment, the number of vertices is fixed to 15, and the weights of edges are uniformly sampled among {1, 2, . . . , 9}. The results are shown in Table 3 . We do not include the results of PD 1 as it does not support this regression task (PD 1 only supports the case when the softened truth value belongs to [0, 1]). It is observed that our approach achieves the lowest MSE/MAE with the highest constraint satisfaction. We further investigated the reason of the performance gain, and found that models trained by the existing methods easily overfit to the training set (e.g., training MSE/MAE nearly vanished). In contrast, the model trained by our method is well regulated by the required logical constraints, making the test error more consistent with the training error.

4.4. IMAGE CLASSIFICATION

We finally evaluate our method on the classic CIFAR100 image classification task. The CIFAR100 dataset contains 100 classes which can be further grouped into 20 superclasses (Krizhevsky et al., 2009) . Hence, we use the logical constraint proposed by Fischer et al. (2019) , i.e., if the model classifies the image into any class, the predictive probability of corresponding superclasses should first achieve full mass. For example, the people superclass consists of five classes (baby, boy, girl, man, and woman), and thus p people (x) should have 100% probability if the input x is classified as girl. We can formulate this logical constraint as s∈superclasses (p s (x) = 0.0% ∨ p s (x) = 100.0%), where the probability of the superclass is the sum of its corresponding classes' probabilities (for example, p people (x) = p baby (x) + p boy (x) + p girl (x) + p man (x) + p woman (x)). We construct a labeled dataset of 10,000 examples and an unlabeled dataset of 30,000 examples by randomly sampling from the original training data, and then train three different models (VGG16, ResNet50, and DenseNet100) to evaluate the performance of different methods. The results of classification accuracy and constraint satisfaction are shown in Figure 2 and Table 8 , respectively. We can observe significant enhancements in both metrics of our approach.

5. RELATED WORK

Learning with constraints. Research on linking learning and logical reasoning has emerged for years (Roth & Yih, 2004; Chang et al., 2008; Cropper & Dumančić, 2020) . Early research mostly concentrated on mining well-generalized logical relations (Muggleton, 1992) , i.e., inducing a set of logical (implication) rules based on given logical atoms and examples. Recently, several work switch to training the model with deterministic logical constraints that are explicitly provided (Kimmig et al., 2012; Giannini et al., 2019; Nandwani et al., 2019; van Krieken et al., 2022; Giunchiglia et al., 2022) . They usually use an interpreter to soften the truth value, and utilize fuzzy logic (Wierman, 2016) to encode the logical connectives. However, such conversion is usually quite costly, as they necessitate the use of an interpreter, which should be additionally constructed by the most probable explanation (Bach et al., 2017) . Moreover, the precise logical meaning is lost due to the introduction of Neuron-symbolic learning. Our work is related to neuro-symbolic computation (Garcez et al., 2019; Marra et al., 2021) . In this area, integrating learning into logic has been explored in several directions including neural theorem proving (Rocktäschel & Riedel, 2017; Minervini et al., 2020) , extending logic programs with neural predicates (Manhaeve et al., 2018; Yang et al., 2020) , encoding algorithmic layers (e.g., satisfiability solver) into DNNs (Wang et al., 2019; Chen et al., 2020) , using neural models to approach executable logical programs (Li & Srikumar, 2019; Badreddine et al., 2022; Ahmed et al., 2022) , as well as incorporating neural networks into logic reasoning (Yang et al., 2017; Evans & Grefenstette, 2018; Dong et al., 2019) . Along this direction, there are other ways to integrate logical knowledge into learning, e.g., knowledge distillation (Hu et al., 2016) , learning embeddings for logical rules (Xie et al., 2019) , treating rules as noisy labels (Awasthi et al., 2020) , and abductive reasoning (Dai et al., 2019; Zhou, 2019) . Multi-objective learning. Training model with logical constraints is essentially a multi-objective learning task, where two typical solutions exist (Marler & Arora, 2004) , i.e., the ϵ-constraint method and the weighted sum method. The former rewrites objective functions into constraints, i.e., solving min x f (x), s.t., g(x) ≤ ϵ instead of the original problem min x (f (x), g(x)). E.g., Donti et al. (2021) directly solve the learning problem with (hard) logical constraints via constraint completion and correction. However, this method may not be sufficiently efficient for deep learning tasks considering the high computational cost of Hessian-vector computation and the ill-conditionedness of the problem. The latter method is relatively more popular, which minimizes a proxy objective, i.e., a weighted average min x w 1 f (x) + w 2 g(x). Such method strongly depends on the weights of the two terms, and may be highly ineffective when the two losses conflict (Kendall et al., 2018; Sener & Koltun, 2018) .

6. CONCLUSION

In this paper, we have presented a new approach for better integrating logical constraints into deep neural networks. The proposed approach encodes logical constraints into a distributional loss that is compatible with the original training loss, guaranteeing monotonicity for logical entailment, significantly improving the interpretability and robustness, and avoiding shortcut satisfaction of the logical constraints at large. The proposed approach has been shown to be able to improve both model generalizability and logical constraint satisfaction. A limitation of the work is that we set the target distribution of any logical formula as the Dirac distribution, but further investigation is needed to decide when such setting could be effective and whether an alternative could be better. Additionally, our approach relies on the quality of the manually inputted logical formulas, and complementing it with automatic logic induction from raw data is an interesting future direction.

A PROOF OF THEOREM 1 A.1 TRANSLATION EQUIVALENCE OF LOGICAL CONJUNCTION

The following proposition shows the equivalence of our translation and the original expression of logical conjunction. Proposition 1. Given l := k i=1 v i ≤ c i where v i = f w,i (x, y) and S i := max(v i -c i , 0) for i = 1, . . . , k, if {τ * i } k i=1 are the optimal solution of the cost function S ∧ (l), k i=1 τ * i S(v i ) = max(S(v 1 ), . . . , S(v k )). Furthermore, v * = f w * (x, y) |= l if and only if w * is the optimal solution of min w S ∧ (w), with the optimal value S ∧ (w * ) = max(S 1 (w * ), . . . , S k (w * )) = 0. Proof. For the minimization of the cost function S ∧ (w) min w S ∧ (w) := max(S 1 (w), . . . , S k (w)), we introduce a slack variable t, and rewrite Eq. equation 8 as min w,t t, s.t., t ≥ S i (w), i = 1, . . . , k. The Lagrangian function of Eq. equation 9 is L(w, t; τ 1 , . . . , τ k ) = t + k i=1 τ i (S i (w) -t), where τ i ≥ 0, i = 1, . . . , k. Let the gradient of t vanish, we can obtain the dual problem: max τ1,...,τ k min w τ 1 S 1 (w) + • • • + τ k S k (w), s.t. τ 1 + • • • + τ k = 1, 0 ≤ τ i ≤ 1, i = 1, . . . , k. By using the max-min inequality, we have max τ1,...,τ k min w τ 1 S 1 (w) + • • • + τ k S k (w) ≤ min w max τ1,...,τ k τ 1 S 1 (w) + • • • + τ k S k (w). Therefore, the cost function S ∧ (w) of the logical conjunction can be computed by introducing the dual variables τ i , i = 1, . . . , k: S ∧ (w) = max τ1,...,τ k ∈[0,1] τ1+•••+τ k =1 τ 1 S 1 (w) + • • • + τ k S k (w). The Karush Kuhn-Tucker (KKT) condition of Eq. equation 9 is τ 1 ∇S 1 (w) + • • • + τ k ∇S k (w) = 0, τ 1 + . . . , τ k = 1, t ≥ S i (w), τ i ≥ 0, i = 1, . . . , k, τ i (t -S i (w)) = 0, i = 1, . . . , k. We denote the index set of the largest element in the set {S i } k i=1 by I. Suppose dual variables {τ i } k i=1 converge to {τ * i } k i=1 , then τ * j = 0 for any j / ∈ I, and thus i∈I τ * i = 1. Since S i (w) = max(S 1 (w), . . . , S k (w)) for any i ∈ I, we have S ∧ (w) = τ * 1 S 1 (w) + • • • + τ * k S k (w) = i∈I τ * i S i (w) = max(S 1 (w), . . . , S k (w)). Furthermore, if w * is the optimal solution and S ∧ (w * ) = 0, we have S 1 (w * ) = • • • = S k (w * ) = 0, which implies that w * ensures the satisfaction of the constraint k i=1 v i ≤ c i . On the other hand, if v * = f w * (x, y) entails the logical conjunction k i=1 v i ≤ c i , then we have S i (w * ) = 0 for i = 1, . . . , k. Therefore, we have w * is the optimal solution of S ∧ (w * ) with S ∧ (w * ) = 0.

A.2 TRANSLATION EQUIVALENCE OF LOGICAL DISJUNCTION

We also have the following proposition demonstrating the equivalence between the cost function S ∨ (l) and the original expression of logical disjunction. Proposition 2. Given l := k i=1 v i ≤ c i , where v i = f w,i (x, y) and S i := max(v i -c i , 0) for i = 1, . . . , k, if {τ * i } k i=1 are the optimal solution of the cost function S ∨ (l), we have k i=1 τ * i S(v i ) = min(S(v 1 ), . . . , S(v k )). Furthermore, v * = f w * (x, y) |= l if and only if w * is the optimal solution of min w S ∨ (w), with the optimal value S ∨ (w * ) = min(S 1 (w * ), . . . , S k (w * )) = 0. Proof. For the minimization of the cost function S ∨ (w) min w S ∨ (w) := min(S 1 (w), . . . , S k (w)), we introduce a slack variable t, and rewrite Eq. equation 10 as min w max t t, s.t., t ≤ S i (w), i = 1, . . . , k. The corresponding dual problem is min τ1,...,τ k min w τ 1 S 1 (w) + • • • + τ k S k (w), s.t. τ 1 + • • • + τ k = 1, 0 ≤ τ i ≤ 1, i = 1, . . . , k. Therefore, the cost function S ∨ (w) of the logical disjunction can be computed by introducing the dual variables τ i , i = 1, . . . , k: S ∨ (w) = min τ1,...,τ k ∈[0,1] τ1+•••+τ k =1 τ 1 S 1 (w) + • • • + τ k S k (w). Similar to the proof of Proposition 1, by using the KKT condition of Eq. equation 11, and supposing dual variables {τ i } k i=1 converge to {τ * i } k i=1 , we can obtain that S ∨ (w) = τ * 1 S 1 (w) + • • • + τ * k S k (w) = i∈I τ * i S i (w) = min(S 1 (w), . . . , S k (w)). If w * is the optimal solution and S ∨ (w * ) = 0, there exists S i (w * ) = 0, which implies that w * ensures the satisfaction of the constraint k i=1 v i ≤ c i . On the other hand, if w * entails the logical disjunction k i=1 v i ≤ c i , then there exists S i (w * ) = 0. Therefore, we have w * is the optimal solution of S ∨ (w * ) with S ∨ (w * ) = 0.

A.3 PROOF OF THEOREM 1

The proof can be directly derived from Proposition 1 and Proposition 2.

B PROOF OF THEOREM 2

Proof. We first denote all variables v 1 , . . . , v k involved in the logical constraint by a vector v, and in this sense the corresponding constants c i , i = 1, . . . , k, in the logical constraint should be extended to R ∪ {+∞} such that the constraints α and β can be written as α := ∧ i∈I ∨ j∈J (v ≤ c (α) ij ), β := ∧ i∈I ∨ j∈J (v ≤ c (β) ij ). For the sufficient condition, suppose that v * |= α. Then, we have S α (v * ) = 0 by using Theorem 1. Since S α (v) ≥ S β (v) holds for any v ∈ V, and the cost function is non-negative, we can obtain that S β (v * ) = 0, which implies that v * |= β. For the necessary condition of Theorem 2, we introduce the following proposition. Proposition 3. Given the underlying space V, for any point v ∈ V and subset C ⊆ V, we define the distance of v from C as dist(v, C) = inf{dist(v, u) | u ∈ C}. Moreover, let A and B be two closed subsets of V, if A is not an empty set, then the sufficient and necessary condition for A ⊆ B is that dist(v, A) ≥ dist(v, B), ∀v ∈ V. Therefore, let A and B be two sets implied by α and β, i.e., A = ∩ i∈I ∪ j∈J {v | v ≤ c (α) ij }, B = ∩ i∈I ∪ j∈J {v | v ≤ c (β) ij }, by using Proposition 3, we can obtain that α |= β if and only if dist(v, A) ≥ dist(v, B) for any v ∈ V. Given atomic formulas a 1 := v ≤ c 1 and a 2 := v ≤ c 2 , the corresponding cost functions are S 1 (v) = max(v -c 1 , 0) and S 2 (v) = max(v -c 2 , 0), respectively. Then, the cost functions are essentially the (Chebyshev) distances of v to the constraints {v | v ≤ c 1 } and {v | v ≤ c 2 }, respectively. (It should be noted that v ≤ c can be decomposed into the conjunction of v i ≤ c i , i = 1, . . . , k). Hence, through Proposition 3, the sufficient and necessary condition of a 1 |= a 2 is that S a1 (v) ≥ S a2 (v) for any v ∈ V. Now, the rest is to prove that the cost functions of logical conjunction and disjunction are still the distance of v from the corresponding constraint, and it is not difficult to derive the results through a few calculations of linear algebra. One can also obtain a direct result by using Martinón (2004, Theorem 1).

C CONCRETE EXAMPLES IN ROBUSTNESS IMPROVEMENT

(1) Let us consider a logical constraint (v 2 ≤ -1) ∨ (3v ≥ 2), the corresponding cost function based on the min and max operators is S(v) = min(v 2 + 1, max(2 -3v, 0)). If we directly minimize S(v) and set the initial point by v 0 = 0, we will have S(v 0 ) = v 2 0 +1 = 1 and ∇ π S(v 0 ) = 2v 0 = 0. Hence, v 0 has already been an optimal solution of min S(v). In this case, the conventional optimization technique is not effectual, and cannot find a feasible solution that entails the disjunction even though it exists. Nevertheless, with the dual variable τ , the minimization problem is min v,τ τ (v 2 + 1) + (1 -τ )(max(2 -v, 0)). Given initial points v 0 = 0 and τ 0 = 0.5, one can easily obtain a feasible solution v * = 1.5 via the coordinate descent algorithm. (2) For translation strategy used in DL2, the cost functions of conjunction a 1 ∧ a 2 and disjunction a 1 ∨ a 2 are defined by S ∧ (v) = S(v 1 ) + S(v 2 ) and S ∨ (v) = S(v 1 ) • S(v 2 ), respectively. The conjunction translation is essentially a special case of our encoding strategy (i.e., τ 1 and τ 2 are fixed to 0.5). For the disjunction, the multiplication may ruin the magnitude of the cost function, making it no longer a reasonable measure of constraint satisfaction. Moreover, this translation method also brings more difficulties to numerical calculations. For example, considering the disjunction constraint (v = 1) ∨ (v = 2) ∨ (v = 3), S ∨ (v) = S(v 1 ) • S(v 2 ) • S(v 3 ) induces two more bad stationary points (i.e., v = 1.5 and v = 2.5) compared with min v,τ S ∨ (v) = τ 1 S(v 1 ) + τ 2 S(v 2 ) + τ 3 S(v 3 ).

D THE COMPUTATION OF KL DIVERGENCE

For input-output pair (x, y), with the logical variable z, the KL divergence is KL(p(y, z | x) | p w (y, z | x)) = y,z p(y, z | x) log p(y, z | x) p w (y, z | x) = y,z p(y, z | x) log p(y | x) p w (y | x) + log p(z | x, y) p w (z | x, y) For the first term in RHS, y,z p(y, z | x) log p(y | x) p w (y | x) = KL(p(y | x)∥p w (y | x)), and for the second term in RHS, y,z p(y, z | x) log p(z | x, y) p w (z | x, y) = y,z p(y | x)p(z | x, y) log p(z | x, y) p w (z | x, y) = E y|x z p(z | x, y) log p(z | x, y) p w (z | x, y) = E y|x [KL(p(z | x, y)∥p w (z | x, y))]. It follows that KL(p(y, z | x)∥p w (y, z | x)) = KL(p(y | x)∥p w (y | x)) + E y|x [KL(p(z | x, y)∥p w (z | x, y))].

E KL DIVERGENCE OF TRUNCATED GAUSSIANS

Given two normal distributions truncated on [0, +∞) with means µ 1 and µ 2 and variances σ 2 1 and σ 2 2 , the KL divergence can be computed as (Choudrey, 2002)[Appendix A.5 ] KL(T N 1 ∥T N 2 ) = 1 2 σ 2 1 σ 2 2 -1 -log σ 2 1 σ 2 2 + (µ 1 -µ 2 ) 2 σ 2 2 + [( 1 σ 2 1 + 1 σ 2 2 )µ 1 - 2µ 2 σ 2 2 ] σ 1 √ 2π 1 exp( µ 2 1 2σ 2 1 )(1 -erf(-µ1 √ 2σ1 )) + log 1 -erf(-µ2 √ 2σ2 ) 1 -erf(-µ1 √ 2σ1 ) , where erf(•) is the Gauss error function. Let µ 1 = 0 and σ 1 limit to zero. We can obtain that lim σ1→0 KL(T N 1 (0, σ 1 )∥T N 2 (µ 2 , σ 2 )) = -log σ 2 + µ 2 2 2σ 2 2 + log(1 -erf(- µ 2 √ 2σ )).

F THE ALGORITHM OF LOGICAL TRAINING

In practice, we set a lower bound (0.01) for the variance δ 2 considering the numerical stability. G OPTIMALITY OF ALGORITHM 1  )} N i=1 . w t+1 ← w t -η w • ∇ w L(w, δ; τ ∧ , τ ∨ ). δ t+1 ← ( N i=1 µ t i )/N . τ t+1 ∧ ← τ t ∧ + η ∧ • ∇ τ∧ L(w, δ; τ ∧ , τ ∨ ). τ t+1 ∨ ← τ t ∨ -η ∨ • ∇ τ∨ L(w, δ; τ ∧ , τ ∨ ). end for L(w, δ; τ ∧ , τ ∨ ) is convex w.r.t. τ ∧ and τ ∨ , via the convexity of composite functions (Boyd et al., 2004, Sec. 3.2) . Hence, min τ∨ max τ∧ L(w, δ; τ ∧ , τ ∨ ) is in fact a convex-concave optimization, and the PL assumption is satisfied (Yang et al., 2021) , thus the GDA algorithm with suitable step size can achieve a global minimax point (i.e., saddle point) in O(ε -2 ) iterations (Nedić & Ozdaglar, 2009; Adolphs, 2018) . Some more properties of (τ * ∧ , τ * ∨ ) are also detailed in Proposition 1, 2, and Theorem 1. Next, we confirm that, the PL condition holds when the logical constraints are not sufficiently satisfied. Since L(w, δ; τ ∧ , τ ∨ ) is strictly increasing and convex w.r.t. µ i on [0, +∞), we instead analyze the cost function, i.e., µ i = S α (v). Proposition 4. Given logical constraint α, assume its corresponding cost function is S α (v), and max τ∧ min τ∨ S α (v) ≥ κ with constant κ > 0. Then, the PL property for any τ ∧ , i.e., ∥∇ τ∧ S α (v; τ ∧ , τ ∨ )∥ 2 ≥ κ[max τ∧ S α (v; τ ∧ , τ ∨ ) -S α (v; τ ∧ , τ ∨ )] holds for any τ ∧ . Proof. Let t ij = max(v ij -c ij , 0), we have ∥∇ τ∧ S α (v; τ ∧ , τ ∨ )∥ 2 = i∈I ( j∈J ν ij t ij ) 2 , and max τ∧ S α (v; τ ∧ , τ ∨ ) = max i∈I ( j∈J ν ij t ij ). Since max i∈I j∈J ν ij t ij ≥ max i∈I min j∈J t ij ≥ κ, we can obtain that ∥∇ τ∧ S α (v; τ ∧ , τ ∨ )∥ 2 ≥ (max i∈I j∈J ν ij t ij ) 2 ≥ κ max i∈I j∈J ν ij t ij = κ max τ∧ S α (v; τ ∧ , τ ∨ ). Now, we can complete the proof by using the non-negativity of S α (v; τ ∧ , τ ∨ ).

G.2 CONVERGENCE OF MODEL PARAMETERS

For the minimization of L(w, δ; τ ∧ , τ ∨ ) w.r.t. w and δ, it can be viewed as a competitive optimization with min  i | x i )∥p w (y i | x i )), ℓ logic (w, δ) = N i=1 E yi|xi [KL(p(z i | x i , y i )∥p w (z i | x i , y i ))]. A direct method to solve this problem is conducting the gradient descent on (w, δ) together. However, it is highly inefficient since we have to use a smaller step size to ensure the convergence (Lu et al., 2019) . An alternative choice is alternating gradient descent on w and δ, but it may converge to a limit cycle or a saddle point (Powell, 1973) . Algorithm 1 can indeed ensure w * to be an approximately stationary point (i.e., the norm of its gradient is small). Follow the proof of Davis & Drusvyatskiy (2018, Theorem 2 .1), we present the convergence guarantee as follows. We start with the weakly convexity of function min y f (•, y). Proposition 5. Suppose f : X × Y → R is ℓ f -smooth, and ψ(•) = arg min y f (•, y) is continuously differentiable with Lipschitz constant L ψ , then υ(•) = min y f (•, y) is ϱ-weakly convex, where ϱ = ℓ f (1 + L ψ ). Proof. Let y ′ = ψ(x ′ ) and y = ψ(x). Since f is ℓ f -smooth, we can obtain that υ(x ′ ) = f (x ′ , y ′ ) ≥ f (x, y) + ⟨∇ x f (x, y), x ′ -x⟩ + ⟨∇ y f (x, y), y ′ -y⟩ - ℓ f 2 (∥x ′ -x∥ 2 + ∥y ′ -y∥ 2 ) ≥ υ(x) + ⟨∇ x υ(x), x ′ -x⟩ - ℓ f 2 ∥x -x ′ ∥ 2 - ℓ f L ψ 2 ∥x -x ′ ∥ 2 , which finishes the proof. The ρ-weakly convexity of υ(•) implies that υ(x) + (ϱ/2)∥x∥ 2 is a convex function of x. Next, we introduce the Moreau envelope, which plays an important role in our proof. Proposition 6. For a given closed convex function f from a Hibert space H, the Moreau envelope of f is defined by e tf (x) = min y∈H f (y) + 1 2t ∥y -x∥ 2 , where ∥ • ∥ is the usual Euclidean norm. The minimizer of the Moreau envelope e tf (x) is called the proximal mapping of f at x, and we denote it by Prox tf (x) = arg min y∈H f (y) + 1 2t ∥y -x∥ 2 . It is proved in Rockafellar (2015, Theorem 31.5) that the envelope function e tf (•) is convex and continuously differentiable with ∇e tf (x) = 1 t x -Prox tf (x) . The following theorem bridges the Moreau envelope and the subdifferential of a weakly convex function (Jin et al., 2020, Lemma 30) . For details, a small gradient ∥∇e υ (x)∥ implies that x is close to a nearly stationary point x of υ(•).  x t+1 = x t -η∇ x f (x, y), y t+1 = φ(x t+1 ), s.t. ∥y t+1 -ψ(x t+1 )∥ ≤ ϵ. Assume that ψ(•) and φ(•) are Lipschitzian with modules L ψ and L φ , and let ϱ = ℓ f (1 + L ψ ) and ρ = ℓ f (1 + 2L 2 φ ). Then, with step size η = γ/ √ T + 1, the output x of T iterations satisfies E[∥∇e υ/2ρ (x)∥ 2 ] ≤ 4ϱ 2 ρ(ρ + ϱ) e υ/2ρ (x 0 ) -min x υ(x) + ργ 2 L 2 f γ √ T + 1 + ρℓ f ϵ , when ρ ≥ ϱ, E[∥∇e υ/2ϱ (x)∥] ≤ 4ϱ 2 ϱ(3ϱ -ρ) e υ/2ϱ (x 0 ) -min x υ(x) + ϱγ 2 L 2 f γ √ T + 1 + ϱℓ f ϵ , otherwise. Proof. We have υ(x ′ ) = f (x ′ , y ′ ) ≥ f (x ′ , y t ) -⟨∇ y f (x ′ , y ′ ), y t -y ′ ⟩ - ℓ f 2 ∥y t -y ′ ∥ 2 = f (x ′ , y t ) - ℓ f 2 ∥y t -y ′ ∥ 2 ≥ f (x t , y t ) + ⟨∇ x f (x t , y t ), x ′ -x t ⟩ - ℓ f 2 ∥x ′ -x t ∥ 2 - ℓ f 2 ∥y t -y ′ ∥ 2 ≥ υ(x t ) + ⟨∇ x f (x t , y t ), x ′ -x t ⟩ - ℓ f 2 ∥x ′ -x t ∥ 2 - ℓ f 2 ∥y t -y ′ ∥ 2 . Since ∥y t -y ′ ∥ 2 = ∥y t -ψ(x ′ )∥ 2 ≤ 2(∥y t -φ(x ′ )∥ + ∥φ(x ′ ) -ψ(x ′ )∥) 2 ≤ 2(∥y t -φ(x ′ )∥ 2 + ϵ) = 2(∥φ(x t ) -φ(x ′ )∥ 2 + ϵ) ≤ 2(L 2 φ ∥x t -x ′ ∥ 2 + ϵ), where the second inequality is derived by using the Cauchy-Schwarz inequality. Hence, we have υ(x ′ ) ≥ υ(x t ) -ℓ f ϵ + ⟨∇ x f (x t , y t ), x ′ -x t ⟩ -( ℓ f + 2ℓ f L 2 φ 2 )∥x ′ -x t ∥ 2 . ( ) Let ρ = ℓ f + 2ℓ f L 2 φ . Next, we discuss two cases of ρ ≥ ϱ and ρ ≤ ϱ, respectively. • If ρ ≥ ϱ, then let x t = Prox υ/2ρ (x t ) = arg min x υ(x) + ρ∥x -x t ∥ 2 . We can obtain that e υ/2ρ (x t+1 ) = min x υ(x) + ρ∥x -x t+1 ∥ 2 ≤ υ( x t ) + ρ∥x t+1 -x t ∥ 2 = υ( x t ) + ρ∥x t -η∇ x f (x t , y t ) -x t ∥ 2 = υ( x t ) + ρ∥x t -x t ∥ 2 + 2ηρ ⟨∇ x f (x t , y t ), x t -x t ⟩ + η 2 ρ∥∇ x f (x t , y t )∥ 2 = e υ/2ρ (x t ) + 2ηρ ⟨∇ x f (x t , y t ), x t -x t ⟩ + η 2 ρ∥∇ x f (x t , y t )∥ 2 ≤ e υ/2ρ (x t ) + 2ηρ υ( x t ) -υ(x t ) + ℓ f ϵ + ρ 2 ∥x t -x t ∥ 2 + η 2 ρL 2 f , and the last inequality is derived by Eq. equation 14. Taking a telescopic sum over t, we have e υ/2ρ (x T ) ≤ e υ/2ρ (x 0 ) + 2ηρ T t=0 υ( x t ) -υ(x t ) + ℓ f ϵ + ρ 2 ∥x t -x t ∥ 2 + η 2 ρL 2 f T. Rearranging this, we obtain that 1 T + 1 T t=0 υ(x t ) -υ( x t ) - ρ 2 ∥x t -x t ∥ 2 ≤ e υ/2ρ (x 0 ) -min x υ(x) 2ηρT + ℓ f ϵ + ηL 2 f 2 . Since υ(x) is (ϱ/2)-weakly convex, and thus υ(x) + ρ∥x -x t ∥ 2 is (ϱ/2) strongly convex when ρ ≥ ϱ. Therefore, we can obtain that υ(x t ) -υ( x t ) - ρ 2 ∥x t -x t ∥ 2 = υ(x t ) + ρ∥x t -x t ∥ 2 -υ( x t ) -ρ∥ x t -x t ∥ 2 + ρ 2 ∥x t -x t ∥ 2 = υ(x t ) + ρ∥x t -x t ∥ 2 -min x υ(x) + ρ∥x -x t ∥ 2 + ρ 2 ∥x t -x t ∥ 2 ≥ ρ + ϱ 2 ∥x t -x t ∥ 2 = ρ + ϱ 8ϱ 2 ∥∇e υ/2ρ (x t )∥ 2 , where the last equation holds by using Eq. equation 13. One can prove the result by combining this with Eq. equation 15. • If ρ ≤ ϱ, then let x t = Prox υ/2ϱ (x t ) = arg min x υ(x) + ϱ∥x -x t ∥ 2 . We can obtain that e υ/2ϱ (x t+1 ) = min and we finish the proof with plugging this into Eq. equation 16. Theorem 3 can be derived by setting ∆ 0 = e υ/2ρ (x 0 ) -min x υ(x), κ = 4ϱ 2 ρ(ρ+ϱ) if ρ < ϱ, ∆ 0 = e υ/2ϱ (x 0 ) -min x υ(x), κ = 4ϱ 2 ϱ(3ϱ-ρ) if ρ ≥ ϱ. Finally, we confirm the assumptions in our theorems. To ensure the Lipschitz continuity of L(w, δ) in our problem, we practically impose an interval [δ lower , δ upper ]. Furthermore, although we can obtain the closed-form expression of φ(w), i.e., φ(w) = 1 N N i=1 µ i , it is difficult to compute the closed-form expression of ψ(w) = arg min δ L(w, δ) in our work. However, for any δ = ψ(w), it holds that δ = 1 N N i=1 µ 2 i - µ i • δ • ϕ(-µi δ ) 1 -Φ(-µi δ ) , where • means the Hadamard product. Since µ i • ϕ(-µi δ ) is bounded, we can obtain that ψ(w) -δ is also bounded, and thus satisfies the assumption in Theorem 5. operators in training process, and optimize the loss by Adam optimizer with learning rate 1e-3. The model is trained using 3,000 labeled images and 3,000 unlabeled images (only used for our approach). We design a similar logical constraint to that in the CIFAR100 experiment. For details, we define two superclasses for CIFAR10 dataset, i.e., machines (denoted by m) and animals (denoted by a), and the constraint is formulated as (p m (x) = 0.0% ∨ p m (x) = 100.0%) ∧ (p a (x) = 0.0% ∨ p a (x) = 100.0%) . The results are shown in Table 5 . We do observe that the logical rule is more stable compared with model accuracy. Moreover, although such weak logical constraint only slightly improve the model (class and superclass) accuracy on CIFAR10, this increment is still preserved when domain shift occurs. 

K RESULTS OF TRAINING EPOCH RUNTIME

For the convergence rate, Theorem 3 states that our algorithm has the same convergence order with the baseline (stochastic (sub)gradient descent) algorithm. We further show the runtime of each epoch of the CIFAR100 experiment in Table 6 . We can observe that, in each epoch, our method introduces little extra computational cost. L RESULTS WITH STANDARD DEVIATION ON CIFAR100 For task 4, we include a detailed results with Standard Deviation as follows. 



Note that z can be considered as a random variable even when the data point (x, y) is given because there usually exists randomness when training the model (e.g., dropout, batch-normalization, etc.). The set of stationary points contains local Nash equilibria(Jin et al., 2020). In other words, all local Nash equilibria are stationary points, but not vice versa.



Figure2: The accuracy results (%) of image classification on the CIFAR100 dataset. The proposed approach outperforms the competitors in all the three cases for both class and superclass classification.

training (w) + ℓ logic (w, δ), min δ ℓ logic (w, δ), where ℓ training (w) and ℓ logic (w, δ) are training loss and logical loss, i.e., ℓ training (w) = N i=1 KL(p(y

Assume the function υ is ϱ-weakly convex. For any λ < 1/ϱ, let x = Prox λυ (x). If ∥∇e λυ (x)∥ ≤ ϵ, then ∥ x -x∥ ≤ λϵ, and min g∈∂υ( x) ∥g∥ ≤ ϵ. This theorem is an immediate result of the fact that stationary points of υ(•) coincide with those of the smooth function e λυ (•), and one can refer to Drusvyatskiy & Paquette (2019, Lemma 4.3) for more details. Now, we are ready to prove the convergence of Algorithm 1. Theorem 5. Suppose f is ℓ f -smooth and L f -Lipschitz, define υ(•) = min y f (•, y) and ψ(•) = arg min y f (•, y). The iterative updating of minimization min x,y f (x, y) is

) is an approximate stationary point, 2 and (τ * ∧ , τ * ∨ ) is a global minimax point. More detailed analysis and proofs are provided in Appendix G.

Results (%) of the handwritten digit recognition task. The proposed approach learns how the logical constraint is satisfied (i.e., the Q-Sat.) while the existing methods fail. ¬P-Sat. Q-Sat. Acc. Sat. ¬P-Sat. Q-Sat.

The constraint satisfaction results (%) of image classification on the CIFAR100 dataset. The proposed approach outperforms the competitors in all the three cases.the interpreter and it is also unclear if such encoding bears important properties such as monotonicity. To address this issue,Fischer et al. (2019) abandon the interpreter, and use addition/multiplication to encode logical conjunction/disjunction;Yang et al. (2022) apply straight-through estimators to encode CNF into a loss function. However, both of them still lack the monotonicity property.Xu et al. (2018) propose a semantic loss to impose Boolean constraints on the output layer to ensure the monotonicity property. However, its encoding rule does not discriminate different satisfying assignments, causing the shortcut satisfaction problem.Hoernle et al. (2021) aim at training a neural network that fully satisfies logical constraints, and hence directly restrict the model's output to the constraint. Although they introduce a variable to choose which constraint should be satisfied, the variable is categorical and thus limited to mutually exclusive disjunction.

Since µ is convex w.r.t. to τ ∧ and τ ∨ , and L(w, δ; τ ∧ , τ ∨ ) is strictly increasing and convex w.r.t. µ i on [0, +∞), we can derive that Algorithm 1 Logical Training Procedure Initialize: w 0 randomly; τ 0 ∧ and τ 0 ∨ uniformly; δ 0 = 1. for t = 0, 1, . . . , do Draw a collection of i.i.d. data samples {(x i , y i

+ ϱ∥x -x t+1 ∥ 2 ≤ υ( x t ) + ϱ∥x t+1 -x t ∥ 2 = υ( x t ) + ϱ∥x t -η∇ x f (x t , y t ) -x t ∥ 2 = υ( x t ) + ϱ∥x t -x t ∥ 2 + 2ηϱ ⟨∇ x f (x t , y t ), x t -x t ⟩ + η 2 ϱ∥∇ x f (x t , y t )∥ 2 = e υ/2ϱ (x t ) + 2ηϱ ⟨∇ x f (x t , y t ), x t -x t ⟩ + η 2 ϱ∥∇ x f (x t , y t )∥ 2 ≤ e υ/2ϱ (x t ) + 2ηϱ υ( x t ) -υ(x t ) + ℓ f ϵ + ρ 2 ∥x t -x t ∥ 2 + η 2 ϱL 2 f .Taking a telescopic sum over t, we havee υ/2ϱ (x T ) ≤ e υ/2ϱ (x 0 ) + 2ηϱ T t=0 υ( x t ) -υ(x t ) + ℓ f ϵ + ρ 2 ∥x t -x t ∥ 2 + η 2 ϱL 2 f T. ) -υ( x t ) -ρ 2 ∥x t -x t ∥ 2 ≤ e υ/2ϱ (x 0 ) -min x υ(x) 2ηϱT + ℓ f ϵ + ηL 2

Results from CIFAR10 to STL10.

Runtime of each training epoch.

Accuracy of class classification on CIFAR100 dataset.

The constraint satisfaction results (%) of image classification on the CIFAR100 dataset.

ACKNOWLEDGMENT

We are thankful to the anonymous reviewers for their helpful comments. This work is supported by the National Natural Science Foundation of China (Grants #62025202, #62172199). T. Chen is also partially supported by Birkbeck BEI School Project (EFFECT) and an overseas grant of the State Key Laboratory of Novel Software Technology under Grant #KFKT2022A03. Yuan Yao (y.yao@nju.edu.cn) and Xiaoxing Ma (xxm@nju.edu.cn) are the corresponding authors.

H ADDITIONAL DETAILS FOR EXPERIMENTS

We implemented our approach via the PyTorch DL framework. For PD and DL2, we use the code provided by the respective authors. The experiments were conducted on a GPU server with two Intel Xeon Gold 5118 CPU@2.30GHz, 400GB RAM, and 9 GeForce RTX 2080 Ti GPUs. The server ran Ubuntu 16.04 with GNU/Linux kernel 4.4.0.Handwritten Digit Recognition. For this experiment, we used the LeNet-5 architecture, set the batch size to 128, the number of epochs to 60. For the baseline, DL2, and our approach, we optimized the loss using Adam optimizer with learning rate 1e-3. For the PD method, we direct follow the hyper-parameters provided in its Github repository. For the SL method, we set the weight of constraint loss by 0.5, and optimized the loss using Adam optimizer with learning rate 5e-4 (which is used in its Github repository).Handwritten Formula Recognition. For this experiment, we used the LeNet-5 architecture, set the batch size to 128, and fixed the number of epochs to 600. For the baseline, DL2, and our approach, we optimized the loss using Adam optimizer with learning rate 1e-3 and weight decay 1e-5. For the PD method, we direct follow the hyper-parameters provided in its Github repository.Shortest Path Distance Prediction. We used the multilayer perceptron with |V | × |V | input neurons, three hidden layers with 1,000 neurons each, and an output layer of |V | neurons. We used the first node (with the smallest index) as the source node. The input is the adjacency matrix of the graph and the output is the distance from the source node to all the other nodes. We applied ReLU activations for each hidden layer and output layer (to make the prediction non-negative). The batch size and the number of epochs were set to 128 and 300, respectively. The network was optimized using Adam optimizer with learning rate 1e-4 and weight decay 5e-4. CIAFR100 Image Classification. In this task, we set batch size by 128, and the number of epochs by 3,600. We trained the baseline model by SGD algorithm with learning rate 0.1, and set the learning rate decay ratio by 0.1 for each 300 epochs. For other methods, we used Adam optimizer with learning rate 5e-4. On HWF dataset, we also investigate whether the logical knowledge can boost the training efficiency.

I TRAINING EFFICIENCY BY INJECTING LOGICAL KNOWLEDGE

In details, we use the labeled data only (i.e., 2% and 5% of training data) to train the model with/without the logical constraints. 

J RESULTS OF TRANSFER LEARNING EXPERIMENT

To show the robustness of the proposed approach in transfer learning, we use the STL10 dataset to evaluate a ResNet18 model trained on the CIFAR10 dataset. We set the batch size of 100, and the number of epochs by 300. For both baseline and our approach, we remove the data augmentation

