A MULTI-GRAINED SELF-INTERPRETABLE SYMBOLIC-NEURAL MODEL FOR SINGLE/MULTI-LABELED TEXT CLASSIFICATION

Abstract

Deep neural networks based on layer-stacking architectures have historically suffered from poor inherent interpretability. Meanwhile, symbolic probabilistic models function with clear interpretability, but how to combine them with neural networks to enhance their performance remains to be explored. In this paper, we try to marry these two systems for text classification via a structured language model. We propose a Symbolic-Neural model that can learn to explicitly predict class labels of text spans from a constituency tree without requiring any access to spanlevel gold labels. As the structured language model learns to predict constituency trees in a self-supervised manner, only raw texts and sentence-level labels are required as training data, which makes it essentially a general constituent-level self-interpretable classification model. Our experiments demonstrate that our approach could achieve good prediction accuracy in downstream tasks. Meanwhile, the predicted span labels are consistent with human rationales to a certain degree.

1. INTRODUCTION

Lack of interpretability is an intrinsic problem in deep neural networks based on layer-stacking for text classification. Many methods have been proposed to provide posthoc explanations for neural networks (Lipton, 2018; Lundberg & Lee, 2017; Sundararajan et al., 2017) . However, these methods have multiple drawbacks. First, there is only word-level attribution but no high-level attribution such as those over phrases and clauses. Take sentiment analysis as an example, in addition to the ability to recognize the sentiment of sentences, an ideal interpretable model should be able to identify the sentiment and polarity reversal at the levels of words, phrases, and clauses. Secondly, as argued by Rudin (2019) , models should be inherently interpretable rather than explained by a posthoc model. A widely accepted property of natural languages is that "the meaning of a whole is a function of the meanings of the parts and of the way they are syntactically combined" (Partee, 1995) . Compared with the sequential outputs of layer-stacked model architectures, syntactic tree structures naturally capture features of various levels because each node in a tree represents a constituent span. Such a characteristic motivates us to think about whether the representations of these internal nodes could be leveraged to design an inherently constituent-level interpretable model. One challenge faced by this idea is that traditional syntactic parsers require supervised training and have degraded performance on out-of-domain data. Fortunately, with the development of structured language models (Tu et al., 2013; Maillard et al., 2017; Choi et al., 2018; Kim et al., 2019) , we are now able to learn hierarchical syntactic structures in an unsupervised manner from any raw text. In this paper, we propose a general selfinterpretable text classification model that can learn to predict span-level labels unsupervisedly as shown in Figure 1 . Specifically, we propose a novel label extraction framework based on a simple inductive bias for inference. During training, we maximize the probability summation of all potential trees whose extracted labels are consistent with a gold label set via dynamic programming with linear complexity. By using a structured language model as the backbone, we are able to leverage the internal representations of constituent spans as symbolic interfaces, based on which we build transition functions for the dynamic programming algorithm. The main contribution of this work is that we propose a Symbolic-Neural model, a simple but general model architecture for text classification, which has three advantages: 1. Our model has both competitive prediction accuracy and self-interpretability, whose rationales are explicitly reflected on the label probabilities of each constituent. 2. Our model can learn to predict span-level labels without requiring any access to span-level gold labels. 3. It handles both single-label and multi-label text classification tasks in a unified way instead of transferring the latter ones into binary classification problems (Read et al., 2011) in conventional methods. To the best of our knowledge, we are the first to propose a general constituent-level self-interpretable classification model with good performance on downstream task performance. Our experiment shows that the span-level attribution is consistent with human rationales to a certain extent. We argue such characteristics of our model could be valuable in various application scenarios like data mining, NLU systems, prediction explanation, etc, and we discuss some of them in our experiments.

2.1. ESSENTIAL PROPERTIES OF STRUCTURED LANGUAGE MODELS

Structured language models feature combining the powerful representation of neural networks with syntax structures. Though many attempts have been made about structured language models (Kim et al., 2019; Drozdov et al., 2019; Shen et al., 2021) , three prerequisites need to be met before a model is selected as the backbone of our method. Firstly, it should have the ability to learn reasonable syntax structure in an unsupervised manner. Secondly, it computes an intermediate representation for each constituency node. Thirdly, it has a pretraining mechanism to improve representation performance. Since Fast-R2D2 (Hu et al., 2022; 2021 ) satisfies all the above conditions and also has good inference speed, we choose Fast-R2D2 as our backbone.

2.2. FAST-R2D2

Overall, Fast-R2D2 is a type of structured language model that takes raw texts as input and outputs corresponding binary parsing trees along with node representations as shown in Figure 3 (a). The representation e i,j representing a text span from the i th to the j th word is computed recursively from its child node representations via a shared composition function, i.e., e i,j = f (e i,k , e k+1,j ), where k is the split point given by the parser and f (•) is an n-layered Transformer encoder. When i = j, e i,j is initialized as the embedding of the corresponding input token. Please note the parser is trained in a self-supervised manner, so no human-annotated parsing trees are required.

3. SYMBOLIC-NEURAL MODEL 3.1 MODEL

There are two basic components in the Symbolic-Neural model: 1. A Structured LM backbone which is used to parse a sentence to a binary tree with node representations. 2. An MLP which is used to estimate the label distribution from the node representation. For Structured LMs that follow a bottom-up hierarchical encoding process (such as our default LM Fast-R2D2), context outside a span is invisible to the span, which may make low-level short spans unable to predict correct labels because of a lack of information. So we introduce an optional module to allow information to flow in parse trees from top to down. The overall idea is to construct a top-down process to fuse information from both inside and outside of spans. For a given span (i, j), we denote the top-down representation as e ′ i,j . We use the Transformer as the top-down encoder function f ′ . The top-down encoding process starts from the root and functions recursively on the child nodes. For the root node, we have [•, e ′ 1,n ] = f ′ ([e root , e 1,n ]) where e root is embedding of the special token [ROOT] and n is the sentence length. Once the top-down representation e ′ i,j is ready, we compute its child representations recursively via [•, e ′ i,k , e ′ k+1,j ] = f ′ ([e ′ i,j , e i,k , e k+1,j ]) as illustrated in Figure 2 . We denote the parameters of the model as Ψ, the parameters used in the Structured LM as Φ and the parameters used in the MLP layer and the top-down encoder as Θ. Thus Ψ = {Φ, Θ}.

3.2. LABEL EXTRACTION FRAMEWORK FOR INFERENCE

During inference, we first use Fast-R2D2 to produce a parsing tree, then predict the label of each node in the parse tree and output a final label set by the yield function introduced below. Inductive bias. Through observing cases in single/multilabel classification tasks, we propose an inductive bias that a constituent in a text corresponds to at most one label. As constituents could be seen as nodes in a binary parsing tree, we can associate the nodes with labels. Nodes with multiple labels could be achieved by assigning labels to non-overlapping child nodes. Please note such an inductive bias is not applicable for special cases in which a minimal semantic constituent of a text is associated with multiple labels, e.g., the movie "Titanic" could be labeled with both 'disaster' and 'love'. However, we argue that such cases are rare because our inductive bias works well on most single/multi-label tasks as demonstrated in our experiments. Label Tree. A label tree is transferred from a parsing tree by associating each node with a label. A label tree example is illustrated in Figure 3 (b). During inference, we predict a probability distribution of labels for each node and pick the label with the highest probability. To estimate the label distribution, we have P Ψ (•|n i,j ) = softmax(MLP(e i,j )). Please note if the top-down encoder is enabled, we replace e i,j with e ′ i,j . Algorithm 1 Definition of Yield function 1: function YIELD( t) 2: S = {} 3: q ← [ t.root] 4: ▷ The list of nodes to visit 5: while len(q) > 0 do 6: nij ← q.pop(0) 7: if nij.label == ϕNT then 8: if not nij.is leaf then 9: q.append(nij.left) 10: q.append(nij.right) 11: Yield function. We design a yield function that traverses a label tree in a top-down manner and extracts labels. For brevity, we use Y short for the yield function. We divide the labels into two categories: terminal labels and non-terminal labels, which indicate whether Y should stop or continue respectively when it traverses to a node. Considering some nodes may not be associated with any task-defined labels, we introduce empty labels denoted as ϕ T and ϕ N T for terminal and nonterminal ones respectively. For simplicity, we do not discuss nesting cases 1 in this paper, so there is only one unique non-terminal label which is ϕ N T and all task-defined labels are terminal labels. However, our method can be naturally extended to handle nesting cases by allowing non-terminal labels to be associated with task labels. As defined by the pseudo-code in Algorithm 1, Y traverses a label tree from top to down starting with the root; when it sees ϕ N T , it continues to traverse all its children; otherwise, when it sees a terminal label, it stops and gathers the task-defined terminal label of the node. Figure 3 illustrates how Y traverses the label tree and gathers task-defined labels.

3.3. TRAINING OBJECTIVE.

During the training stage, though the Structured LM can predict tree structures, the difficulty here is how to associate each node with a single label without span-level gold labels. We define our training objective as follows: Training objective Given a sentence S whose length is |S| and its gold label set T = {l 1 , ..., l m }, t is its best parsing tree given by the unsupervised parser of Fast-R2D2 and t is a label tree transferred from t. t[C] denotes t satisfying condition C. The training objective is to maximize the probability of a given tree transferring to a label tree yielding labels that are consistent with the ground-truth labels, which could be formalized as minimizing -log P Ψ ( t[Y( t)=T ] |t). Before we get into the specifics, several key aspects are defined as follows: (1) Denotations: t i,j denotes the subtree spanning from i to j (both indices are inclusive), whose root, left and right subtree are n i,j , t i,k and t k+1,j respectively in which k is the split point. (2) Symbolic Interface: P Ψ (l|n i,j ) is the probability of a single node n i,j being associated with the specified label l. Thus, the probability of t transferring to a specific label tree t is the product of all the probabilities of nodes being associated with the corresponding labels in t. Figure 4 : To ensure that the yield result of ti,j contains label l, node n i,j needs to be associated with either ϕ N T or l, whose probabilities are P Ψ (ϕ N T |n i,j ) and P Ψ (l|n i,j ) respectively. If associated with l, it satisfies the condition. If associated with ϕ N T , at least one of its children's yield results should contain l. Here we use \l to denote that the yield result does not contain label l. In conclusion, Y l i,j could be estimated recursively by Equation 1. Obviously, it is intractable to exhaust all potential t to estimate P Ψ ( t[Y( t)=T ] |t). Our core idea is to leverage symbolic interfaces to estimate P Ψ ( t[C] |t) via dynamic programming. We start with an elementary case: estimate the probability that the yield result of t i,j contains a given label l, i.e., P Ψ ( t[l∈Y( ti,j )] i,j |t i,j ). For brevity, we denote it as Y l i,j . As the recursive formulation illustrated in Figure 4 , we have: Y l i,j = PΨ(l|ni,j) + PΨ(ϕNT |ni,j) • (1 -(1 -Y l i,k ) • (1 -Y l k+1,j )) if i < j PΨ(l|ni,j) if i = j (1) However, for a given label set M, if we try to estimate P Ψ ( t[Y( ti,j )=M] i,j |t i,j ) in the same way, we will inevitably exhaust all potential combinations as illustrated in Figure 5 (a) which will lead to exponential complexity. 2To tackle the problem of exponential complexity, we try to divide the problem of estimating P Ψ ( t[Y( t)=M] i,j |t i,j ) to estimating Y l i,j for each label l in M. Let F denote the set union of all the task labels and {ϕ T , ϕ N T }, and let O denote F \ T . By assuming that the states of labels are independent of each other, where the state of a label indicates whether the label is contained in the yield resultfoot_2 , we have: PΨ( t[Y( t)=T ] |t) = PΨ( t[T ⊆Y( t)] , t[O∩Y( t)=ϕ] |t) ≈ PΨ( t[T ⊆Y( t)] |t) • PΨ( t[O∩Y( t)=ϕ] |t) PΨ( t[T ⊆Y( t)] |t) ≈ l∈T PΨ( t[l∈Y( t)] |t) = l∈T Y l i,j , PΨ( t[O∩Y( t)=ϕ] |t) = 1 -PΨ( t[O∩Y( t)̸ =ϕ] |t) (2) We do not approximate P Ψ ( t[O∩Y( t)̸ =ϕ] |t) as it could be computed directly. The above function premises that multiple non-overlapping spans could associate with the same label. In some cases, if there is a mutual-exclusiveness constraint that two non-overlapping spans are not allowed to associate with the same task label as shown in Figure 5 (b), the function becomes: Y l i,j = PΨ(l|ni,j) + PΨ(ϕNT |ni,j) • (Y l i,k • (1 -Y l k+1,j ) + Y l k+1,j • (1 -Y l i,k )) if i < j PΨ(l|ni,j) if i = j (3) Regarding P Ψ ( t[O∩Y( t)̸ =ϕ] |t), Y( t) containing any label l ∈ O would satisfy the condition. We denote it as Y O i,j in short. Similar to Equation 1 we have: Y O i,j = l∈O PΨ(l|ni,j) + PΨ(ϕNT |ni,j) • (1 -(1 -Y O i,k ) • (1 -Y O k+1,j )) if i < j l∈O PΨ(l|ni,j) if i = j (4) Thus P Ψ ( t[Y( t)=T ] |t) = l∈T Y l 1,|S| • (1 -Y O 1,|S| ) and the objective function given a parsing tree is: L t cls (Ψ) = -log P Ψ ( t[Y( t)=T ] |t) = - l∈T log Y l 1,|S| -log(1 -Y O 1,|S| ) Because it has been verified in prior work Hu et al. (2022) that models could achieve better downstream performance and domain-adaptivity by training along with the self-supervised objective L self (Φ), we design the final loss as follows: L = L t cls (Ψ) + L self (Φ) (6) 4 EXPERIMENTS 4.1 DOWNSTREAM TASKS In this section, we compare our interpretable symbolic-Neural model with models based on dense sentence representation to verify our model works as well as conventional models. All systems are trained on raw texts and sentence-level labels only. Data set. We report the results on the development set of the following datasets: SST-2, CoLA (Wang et al., 2019) , ATIS (Hakkani-Tur et al., 2016) , SNIPS (Coucke et al., 2018) , Stan-fordLU (Eric et al., 2017) . Please note that SST-2, CoLA, and SNIPS are single-label tasks and ATIS, StanfordLU are multi-label tasks. There are three sub-fields in StanfordLU including navigator, scheduler, and weather. Baselines. To fairly compare our method with other systems, all backbones such as Fast-R2D2 and BERT (Devlin et al., 2019) are pretrained on the same corpus with the same vocabulary and epochs. We record the best results of running with 4 different random seeds and report the mean of them. Because of GPU resource limit and energy saving, we pretrain all models on Wiki-103 (Merity et al., 2017) , which contains 110 million tokensfoot_3 . To compare our model with systems only using whole sentence representations, we include BERT and Fast-R2D2 using root representation in our baselines. To study the reliability of the unsupervised parser, we include systems with a supervised parser Zhang et al. (2020) that uses BERT or a tree encoder as the backbone. For the former, we take the average pooling on representations of words in span (i,j) as the representation of the span. For the latter, we use the pretrained R2D2 tree encoder as the backbone. To compare with methods dealing with multi-instance learning (MIL) but without structure constraints, we extend the multi-instance learning framework proposed by Angelidis & Lapata (2018) to the multi-instance multi-label learning (MIMLL) scenario. Please find the details about the MIL and MIMLL in Appendix A.7. We also conduct ablation studies on systems with or without the top-down encoder and the mutual-exclusiveness constraint. For the systems using root or [CLS] representations on multilabel tasks, outputs are followed by a sigmoid layer and filtered by a threshold that is tuned on the training set. Hyperparameters. Results and discussion. We make several observations from Table 1 . Firstly, We find that our models overall achieve competitive prediction accuracy compared with strong baselines including BERT, especially on multi-label tasks. The result validates the rationality of our label-constituent association inductive bias. The significant gap compared to MIMLL fully demonstrates the superiority of building hierarchical relationships between spans in the model. Secondly, when using sentence representation, the models with the unsupervised parser achieve similar results to those with the supervised parser on most tasks but significantly outperform the latter on CoLA. A possible reason for the poor performance of the latter systems on CoLA is that there are many sentences with grammar errors in the dataset which are not covered by the training set of the supervised parser. Our While the unsupervised parser can adapt to those sentences as L bilm and L KL are included in the final loss. The result reflects the flexibility and adaptability of using unsupervised parsers. Thirdly, 'parser+TreeEnc.' in Symbolic-Neural architectures does not perform as well as 'parser+TreeEnc.' using sentence representation, while the systems using the unsupervised parser show opposite results. Considering that the Symbolic-Neural model relies heavily on the representation of inner constituents, we suppose such results ascribe to the tree encoder having adapted to the trees given by the unsupervised parser during the pretraining stage of Fast-R2D2, which leads to the self-consistent intermediate representations. This result also verifies the structured language model that learns latent tree structures unsupervisedly is mature enough to be the backbone of our method. 4.2 ANALYSIS OF INTERPRETABILITY. Bastings et al. (2022) propose a method that "poisons" a classification dataset with synthetic shortcuts, trains classifiers on the poisoned data, and then tests if a given interpretability method can pick up on the shortcut. Setup. Following the work, we define two shortcuts with four continuous tokens to access the faithfulness of predicted span labels: #0#1#2#3 and #4#5#6#7 indicate label 1 and 0 respectively. We select SST2 and CoLA as the training sets, with additional 20% synthetic data. We create a synthetic example by (1) randomly sampling an instance from the source data, (2) inserting the continuous tokens at random positions, and (3) setting the label as the shortcut prescribes. Verification steps. The model trained on the synthesis data could achieve 100% accuracy on the synthetic test data and the model trained on the original dataset achieves around 50% on the synthetic test set. Sorting tokens. Since our model does not produce a heatmap for input tokens, it lacks an intuitive way to get top K tokens as required in the shortcut method. So we propose a simple heuristic tree-based ranking algorithm. Specifically, for a given label, we start from the root denoted as n and compare P (l|n lef t ) and P (l|n right ) where n lef t and n right are its left and right children. If P (l|n lef t ) > P (l|n right ), all descendants of the left node are ordered before the descendants of the right child, and vice versa. By recursively ranking according to the above rule, we could have all tokens ranked. We additionally report the precision of shortcut span labels in the predicted label trees. A shortcut span label is correct only if the continuous shortcut tokens are covered by the same span and the predicted label is consistent with the shortcut label. To evaluate the consistency of the span labels learned by our model with human rationales, we design a constituent-level attribution task. Specifically, we hide the gold span positions in NER and slot-filling datasets to see whether our model is able to recover gold spans and labels. So only raw text and sentence-level gold labels are visible to models. We then train models as multi-label classification tasks and evaluate span positions learned unsupervisedly by models. Figure 7 : A sample of our method on semi-supervised slot filling. The ground truths are Denver, Oakland, afternoon, 5 pm, nonstop for each slot correspondingly. However, the last three are reasonable even though different from the ground truths. Data set. We report F1 scores on the following data sets: ATIS (Hakkani-Tur et al., 2016) , MITRestaurant (Liu et al., 2013a) and MITMovie (Liu et al., 2013b) . ATIS is a slot-filling task and the others are NER tasks. Baselines. We include two baselines with attribution ability on multi-label tasks: integratedgradient(IG) (Sundararajan et al., 2017) and multi-instance learning (Angelidis & Lapata, 2018) . We follow the setup in Sec 4.1 and report the results of the last epoch. For IG, we set the interpolation steps as 200 and use the same BERT in the last section as the encoder, filter the attribution of each token by a threshold and select filtered positions as outputs. We use zero vectors and [MASK] embeddings as the baselines for IG as Bastings et al. ( 2022) find the latter one could significantly improve its performance. Considering IG scores not having explicit meaning, we allow IG to adjust thresholds according to the test datasets. We report the best results of both baselines and corresponding thresholds. Please find the full version of the table in Appendix A.4. For MIMLL, we select the span with the max attention score for a specified label. Please find details in Appendix A.7. Metrics. We denote the predicted span set as P and gold span set as G and the overlap of P and G with the same labels as O. Then we have: Results and discussion. From Table 2 , one observation is that models with the mutualexclusiveness constraint achieve better F1 scores. Such results illustrate that a stronger inductive bias is more helpful for models to learn constituent-label alignments. Besides, we find the Neural-Symbolic models significantly outperform the MIMLL and IG baselines on the NER datasets but trail the IG on the slot-filling task. Through studying the outputs of our method, with a sample shown in Figure 7 , we find that our model tends to recall long spans while the ground truths in ATIS tend to be short spans. We also find that on sls-movie-trivial, MIMLL significantly outperforms IG. So we hypothesize that the distribution of golden span lengths may affect results. We divide sentences into buckets according to the average golden span length and compute F1 scores for each bucket, as shown in Table 2 . Interestingly, we find that the scores of IG decline significantly with increasing span lengths, while our method performs well on all the buckets. In addition, we argue that the F1 scores on the NER datasets can reflect interpretability more objectively, because the boundaries of proper nouns are clear and objective, while the choice of slots is relatively ambiguous about whether to include prepositions, modal verbs, etc. We output the label trees generated by our model trained on the Navigator, SST-2, and CoLA to observe whether the model has sufficient interpretability. From Figure 8 we can find our method is able to learn potential alignments of intents and texts and show them explicitly. This can be used in multi-intent NLU systems to help determine the attribution of slots to corresponding intents. We also study the difference between generated label trees of the vanilla Symbolic-Neural model and the Symbolic-Neural topdown . The cases could be found in Appendix A.12. We find the vanilla Symbolic-Neural model fails to deal with multi-intent cases. Such an observation verifies the necessity of introducing the top-down encoder. For SST-2, as there are no neutral samples, we randomly sampled sentences from Wiki-103 as neutral texts and force all nodes to be ϕ N T by the mean squared error loss. Figure 9 (a) shows the sentiment polarity of each constituent and the polarity reversal of "never". Such a characteristic could be used for text mining by gathering the minimal spans of a specified label. We also study the generated label trees on CoLA, a linguistic acceptance data set. We transfer the task to a grammar error detection problem by converting the label "1" to ϕ as "1" means no error is found in a sentence. Figure 9 (b) shows it's able to detect incomplete constituents and may help in applications like grammar error location. More cases could be found in the Appendix. prec = o∈O o.j -o.i + 1 p∈P p.j -p.i + 1 , recall = o∈O o.j -o.i + 1 g∈G g.j -g.i + 1 , F1 = 2 * (prec • recall) (prec + recall)

5. CONCLUSION & LIMITATION

In this paper, we propose a novel label extraction framework based on a simple inductive bias and model single/multi-label text classification in a unified way. We discuss how to build a probabilistic model to maximize the valid potential label trees by leveraging the internal representations of a structured language model as symbolic interfaces. Our experiment results show our method achieves inherent interpretability on various granularities. The generated label trees could have potential values in various unsupervised tasks requiring constituent-level outputs. Regarding to the limitation of our work, we require that the labels corresponding to the texts in the dataset have a certain degree of diversity, thus forcing the model to learn self-consistent constituentlabel alignments. For example, in ATIS, almost all training samples have the same labels like "fromloc.city name" and "toloc.city name". That's why our model fails to accurately associate these two labels with correct spans in Figure 7 .

6. REPRODUCIBILITY STATEMENT

In the supplemental, we include a zip file containing our code and datasets downloading linkage. We've also included in the supplemental the scripts we run all baselines and the Symbolic-Neural models.

7. ACKNOWLEDGEMENT

This work was supported by Ant Group through CCF-Ant Research Fund. We thank the Aliyun EFLOPS team for their substantial support in designing and providing a cutting-edge training platform to facilitate fast experimentation in this work. We also thank Jing Zheng for his help in paper revising and code reviewing.

A APPENDIX

A.1 RELATED WORKS Structured language models. Many attempts have been made to develop structured language models. Pollack (1990) proposed to use RvNN as a recursive architecture to encode text hierarchically, and Socher et al. (2013) showed the effectiveness of RvNNs with gold trees for sentiment analysis. However, both approaches require annotated trees. Gumbel-Tree-LSTMs (Choi et al., 2018) construct trees by recursively selecting two terminal nodes to merge and learning composition probabilities via downstream tasks. CRvNN (Chowdhury & Caragea, 2021) makes the entire process end-to-end differentiable and parallel by introducing a continuous relaxation. However, neither Gumbel-Tree-LSTMs nor CRvNN mention the pretraining mechanism in their work. URNNG (Kim et al., 2019) proposed the first architecture to jointly pretrain a parser and an encoder based on RNNG (Dyer et al., 2016) . However, its O(n 3 ) time and space complexity makes it hard to pretrain on large-scale corpora. ON-LSTM and StructFormer (Shen et al., 2019; 2021) propose a series of methods to integrate structures into LSTM or Transformer by masking information in differentiable ways. As the encoding process is still performed in layer-stacking models, there are no intermediate representations for tree nodes. Maillard et al. (2017) propose an alternative approach, based on a differentiable CKY encoding. The algorithm is differentiable by using a soft-gating approach, which approximates discrete candidate selection by a probabilistic mixture of the constituents available in a given cell of the chart. While their work relies on annotated downstream tasks to learn structures, Drozdov et al. (2019) propose a novel auto-encoder-like pretraining objective based on the inside-outside algorithm Baker (1979) ; Casacuberta (1994) but is still of cubic complexity. To tackle the O(n 3 ) limitation of CKY encoding, Hu et al. (2021) propose an MLM-like pretraining objective and a pruning strategy, which reduces the complexity of encoding to linear and makes the model possible to pretrain on large-scale corpora. Multi-Instance Learning. Multi-Instance learning (MIL) deals with problems where labels are associated with groups of instances or bags (spans in our case), while instance labels are unobserved. |t i,j ). Let M l and M r denote a pair of sets subject to M l ∪ M r = M. Let C(M) denote the set containing all valid M l and M r pairs. Figure 5 (a) discusses all potential combinations of M l and M r when |M| > 1. If |M| > 1, let C(M) be the set of all potential pairs where Y( ti,k ) ∪ Y( tk+1,j ) = M. If |M| = 1, it's similar to the case described in Figure 4 . If M = ϕ, n i,j could only be associated with ϕ T or ϕ N T with M l = ϕ and M r = ϕ. We adopt a canonical multi-instance learning framework used in text classification proposed by Angelidis & Lapata (2018) , in which each instance has a representation and all instances are fused by attention. The original work produces hidden vectors h i for each segment by GRU modules and computes attention weights a i as the normalized similarity of each h i with h a . a i = exp(h T i h a ) i exp(h T i h a ) , p i = softmax(W cls h i + b cls ) , p d = i a i p (c) i , c ∈ [1, C] . ( ) where C is the total class number, p i is the individual segment label prediction, p d is document level predictions. They use the negative log-likelihood of the prediction as an objective function: L cls = -d log p (y d ) d . We simply replace segment representations with span representations in our work as the experiment baseline. Specifically, we use the top-down representation e ′ i,j as the tensor to be attended to and predict the label by e i,j : ( ) where D is the span set for a parsing tree. Please note the MIL model in our baselines is trained together with L bilm and L KL , whose final loss is L cls + L self .

A.7 MULTI-LABEL MULTI-INSTANCE LEARNING BASED ON FAST-R2D2

To support multi-label multi-instance learning, we refactor the above equations to enable them to support attention on different labels. For each label there is vector h  The final objective function is L = -c∈T log p (c) -c∈F \T log(1-p (c) ). In the semi-supervised slot-filling and NER tasks, we let the model predicts labels first and then pick the span with the max attention weight for each label. 



For example, in aspect-based sentiment analysis, a span corresponding to the sentiment may be nested in a span corresponding to the aspect. Details about the dynamic programming algorithm with exponential complexity to estimatePΨ( t[Y( t)=T ] |t) is included in Appendix A.2. Details about the conditional independence assumption could be found in Appendix A.8 Our model could also be tuned based on the public version of pretrained Fast-R2D2 which is available at https://github.com/alipay/StructuredLM_RTDT/releases/tag/fast-R2D2. Details for pretraining BERT and Fast-R2D2 from scratch used in this paper could be found inAppendix A.3 https://huggingface.co/blog/how-to-train



Figure 1: Our model can learn to predict span-level labels without access to span-level gold labels during training. In examples (a) and (b), only raw texts and sentence-level gold labels {request address, navigate} and {negative} are given.

Figure 2: [PRT], [LEFT], [RIGHT] are role embeddings for the corresponding inputs.

Figure 3: (a) A parsing tree. (b) A label tree transferred from the left parsing tree. For the given label tree, Y returns {A,C}. Y stops traversing at terminal nodes in shallow gray whose ancestor labels are all ϕ N T and the nodes in dark gray are not visited.

Figure 5: (a) Potential valid yield results of left and right children for M = {a, b, c}. (b) Valid label trees when we include a mutual-exclusiveness constraint.

Figure 8: A sample of the symbolic-neural model on Navigator with the top-down encoder.

Figure 9: Samples of the symbolic-neural model with the top-down encoder.

-LABEL LEARNING BASED ON FAST-R2D2

,j = softmax(W cls e i,j + b cls ) a m,n p (c) m,n , c ∈ [1, C] .

a m,n p (c) m,n , c ∈ [1, C] .

Published as a conference paper at ICLR 2023 A.11 REAL LABEL TREES SAMPLED FROM SYMBOLIC-NEURAL -t/-e SAMPLED LABEL TREES IN ATIS We sample label trees from Neural-Symbolic -t/-e and Neural-Symbolic +t/-e respectively for observation. Ground truths are annotated in brackets.

Figure 10: The label tree generated by Neural-Symbolic w/o the topdown encoder.

Figure 11: The label tree generated by Neural-Symbolic topdown

BERT follows the setting inDevlin et al. (2019), using 12-layer Transformers with 768-dimensional embeddings, 3,072-dimensional hidden layer representations, and 12 attention heads. The setting of Fast-R2D2 followsHu et al. (2022). Specifically, the tree encoder uses 4-layer Transformers with other hyper-parameters same as BERT and the top-down encoder uses 2-layer ones. The top-down parser uses a 4-layer bidirectional LSTM with 128-dimensional embeddings and 256-dimensional hidden layers. We train all the systems across the seven datasets for 20 epochs with a learning rate of 5 × 10 -5 for the encoder, 1 × 10 -2 for the unsupervised parser, and batch size 64 on 8 A100 GPUs. We report mean accuracy for SST-2, Matthews correlation for CoLA, and F1 scores for the rest. We use "S.N." to denote the systems based on the Symbolic-Neural architecture, and "Sent." to denote those using only whole sentence representations. We use subscript f p for the models based on full permutation, topdown, and exclusive for those with the top-down encoder and the mutual-exclusiveness constraint. Please find the details of S.N. f p in Appendix A.2.

F1 scores for semi-supervised slot filling and NER whose golden span positions are hidden. "Thres." is short for threshold.

annex

Published as a conference paper at ICLR 2023 Finally the transition function for t i,j where i < j is:, |M| > 1 P (m|n i,j ) + P (ϕ N T |n i,j )(X ϕ i,k X M k+1,j + X M i,k X ϕ k+1,j + X M i,k X M k+1,j ) , M = {m} P (ϕ T |n i,j ) + P (ϕ N T |n i,j )(X ϕ i,k X ϕ k+1,j ) , M = ϕ (8) When i = j, we have:, M = {m} P (ϕ T |n i,j ) + P (ϕ N T |n i,j ) , M = ϕ (9)The transition function works in a bottom-up manner and iterates all possible M ⊆ T . X T 1,|S| is the final probability. Even though, iterating C(M) and all M ⊆ T is of exponential complexity, so it only works when |T | is small.

A.3 PRETRAIN BERT AND FAST-R2D2 FROM SCRATCH

The dataset WikiBooks originally used to train BERT (Devlin et al., 2019) is a combination of English Wikipedia and BooksCorpus (Zhu et al., 2015) . However, BooksCorpus is no longer publicly available. So it's hard to pretrain Fast-R2D2 on the same corpus, making it impossible to compare fairly with the publicly available BERT model. Considering the limited GPU resources, we pretrain both BERT and Fast-R2D2 from scratch on Wiki-103. We train BERT from scratch following the tutorial by Huggingface 5 with the masked rate set to 15%. The vocabulary of BERT and Fast-R2D2 is kept the same as the original BERT. As demonstrated in RoBERTa (Liu et al., 2019) that the NSP task is harmful and longer sentence is helpful to improve performance in downstream tasks, we remove the NSP task and use the original corpus that is not split into sentences as inputs. For Fast-R2D2, WikiText103 is split at the sentence level, and sentences longer than 200 after tokenization are discarded (about 0.04‰ of the original data). BERT is pretrained for 60 epochs with a learning rate of 5 × 10 -5 and batch size 50 per GPU on 8 A100 GPUs. Fast-R2D2 is pretrained with learning rate of 5 × 10 -5 for the transformer encoder and 1 × 10 -3 for the parser. Please note that the batch size of Fast-R2D2 is dynamically adjusted to ensure the total length of sentences in a batch won't exceed a certain maximum threshold, to make the batch size similar to that of BERT, the maximum threshold is set to 1536. Because the average sentence length is around 30 for Wiki103, the average batch size of Fast-R2D2 is around 50 which is similar to that of BERT. We argue the independence assumption used in our objective actually is weaker than the one used in conventional multi-label classification tasks. Formally, conventional multi-label classification is the problem of finding a model that maps inputs x to binary vectors y; that is, it assigns a value of 0 or 1 for each element (label) in y. So the objective of multi-label classification is to minimize: -log P ( i∈T y i = 1, j∈O y j = 0|x), where T denotes the indices for golden labels and O denotes the indices not in T . It's impossible to tractably estimate it without introducing some conditional independence assumption. By assuming the states of labels are independent of each other, we have: 

