NEURAL-BASED CLASSIFICATION RULE LEARNING FOR SEQUENTIAL DATA

Abstract

Discovering interpretable patterns for classification of sequential data is of key importance for a variety of fields, ranging from genomics to fraud detection or more generally interpretable decision-making. In this paper, we propose a novel differentiable fully interpretable method to discover both local and global patterns (i.e. catching a relative or absolute temporal dependency) for rule-based binary classification. It consists of a convolutional binary neural network with an interpretable neural filter and a training strategy based on dynamically-enforced sparsity. We demonstrate the validity and usefulness of the approach on synthetic datasets and on an open-source peptides dataset. Key to this end-to-end differentiable method is that the expressive patterns used in the rules are learned alongside the rules themselves.

1. INTRODUCTION

During the last decades, machine learning and in particular neural networks have made tremendous progress on classification tasks for a variety of fields such as healthcare, fraud detection or entertainment. They are able to learn from various data types ranging from images to timeseries and achieve impressive classification accuracy. However, they are difficult or impossible to understand by a human. Recently, explaining those black-box models has attracted considerable research interest under the field of Explainable AI (XAI). However, as stated by Rudin (2019) , those aposteriori approaches are not the solution for high stakes decision-making and more interest should be placed on learning models that are interpretable in the first place. Rule-based methods are interpretable, human-readable and have been widely adopted in different industrial fields with Business Rule Management Systems (BRMS). In practice however, those rules are manually written by experts. One of the reasons manually-written rule models cannot easily be replaced with learned rule models is that rule-base learning models are not able to learn as expressive rules with higher-level concepts and complex grammar (Kramer, 2020) . Moreover, due to the lack of latent representations, rule-based learning methods underperform w.r.t. state-of-the-art neural networks (Beck & Fürnkranz, 2021) . Classical classification rule learning algorithms (Cohen, 1995; Breiman et al., 1984; Dash et al., 2018; Lakkaraju et al., 2016; Su et al., 2016) as well as neural-based approaches to learn rules (Qiao et al., 2021; Kusters et al., 2022) (or logical expressions with Riegel et al. (2020) ) do not provide the grammar required to learn classification rules on sequential data. Numerous approaches for learning classification rules on sequential data in the field of sequential pattern mining have been studied in the past such as Egho et al. (2015) ; Zhou et al. (2013) ; Holat et al. (2014) but with a different goal in mind : improve the performance of extracted patterns for a fixed rule grammar as opposed to extending the rule grammar. Another domain of research focuses on training binary neural networks to obtain more computational efficient model storing, computation and evaluation efficiency (Geiger & Team, 2020; Helwegen et al., 2019) . It comes with fundamental optimization challenges around weights updates and gradient computation. In this paper, we bridge three domains and introduce a binary neural network to learn classification rules on sequential data. We propose a differentiable rule-based classification model for sequential data where the conditions are composed of sequence-dependent patterns that are discovered alongside the classification task itself. More precisely, we aim at learning a rule of the following structure: if pattern then class = 1 else class = 0. In particular we consider two types of patterns: local and global patterns as introduced in Aggarwal (2002) that are in practice studied independently with a local and a global model. A local pattern describes a subsequence at a specific position in the sequence while a global pattern is invariant to the location in the sequence (Fig 2) . The network, that we refer to as Convolutional Rule Neural Network (CR2N), builds on top of a base rule model that is comparable to rule models for tabular data presented in Qiao et al. (2021) ; Kusters et al. (2022) . The contributions of this paper are the following: i) We propose a convolutional binary neural network that learns classification rules together with the sequence-dependent patterns in use. ii) We present a training strategy to train a binarized neural network while dynamically enforcing sparsity. iii) We show on synthetic and real world datasets the usefulness of our architecture with the importance of the rule grammar and the validity of our training process with the importance of sparsity. The code is publicly available at https://github.com/IBM/cr2n.

2. BASE RULE MODEL

The base rule model we invoke is composed of three consecutive layers (Fig 1 ). The two last layers respectively mimic logical AND and OR operators (Qiao et al., 2021; Kusters et al., 2022) . On top of these layers, we add an additional layer that is specific for categorical input data and corresponds to an OR operator for each categorical variable over every possible value it can take.  x 1 x 2 x 3 x 1 x 2 h 1 h 2 h 3 h 4 h 5 h 2 y StackedOR layer AND layer OR layer k n k , K (K, H) (H, 1) xc 1 = A1 . . . B1 c3 (x c k ∈ {A k , B k , C k , D k }). For simplicity, the truth value of x c1 = B 1 is replaced by B 1 for example. Plain (dotted) lines represent activated (masked) weights. An example evaluation of the model is represented with the filled neurons (neuron=1) for the binary input x c1 = B 1 , x c2 = D 2 and x c3 = A 3 . The AND layer takes binary features (which are atomic boolean formulae) as input and outputs to the OR layer. The output of the OR layer is mapped to the classification label y. These layers have binary weights specifying the nodes that are included in the respective boolean expression (conjunction or disjunction). In other words, this network implements the evaluation of a DNF and has a direct equivalence with a binary classification rule like if (A ∧ B) ∨ C then class = 1 else class = 0, where A, B and C are binary input features (atoms in logical terms). In this paper, we focus on supervised binary classification where we predict the label y ∈ {0, 1} given input data x. The base rule model is illustrated in Fig 1 and is composed of three binary neural layers. • Input neurons x, are binarized input features of size K (x c are one-hot encoded categorical input features of size n). • Hidden neurons h, are conjunctions of the input features of size H. • Output neuron y, is a disjunction of the (hidden) conjunctions. We assign to each of boolean operations, i.e. AND and OR operations, a binary weight (W and and W or respectively) that plays the role of a mask to filter nodes with regards to their respective logical operation. For the sake of simplicity, we did not extend the model with a logical NOT operation. The disjunction operation is implemented as, y = min(Worh, 1). (1) If none of the neurons h are activated then y = 0, and y = 1 if at least one is. For the conjunction operation, we use the De Morgan's law that express the conjunction with the OR operator A ∧ B = ¬(¬A ∨ ¬B). Combined with Eq 1, we obtain: h = ¬(min(W and (¬x), 1)) = 1 -min(W and (1 -x), 1). ( ) StackedOR Input Layer As defined previously, the AND layer takes binary input features as input. In this paper, we propose to add an additional layer for categorical data. A categorical variable x c can take one value α c i out of a fixed number of possible values n e.g. {α 0 , . . . , α 3 } = {A, B, C, D}. Without any additional layer, it requires a one-hot encoding to be provided as input to the AND layer. Binary inputs x c = A and x c = B are then given as input to the AND layer that can in theory represent the impossible expression x c = A ∧ x c = B i.e the model has to learn the hidden categorical relationship between the one-hot encoded variables. To prevent learning a distribution we already know, we deepen the model with the addition of a stacked architecture of OR layers as input of the AND layer as shown in Fig 1 . This structure is defined by K weights, W k stack , for each input categorical variables and will be referred to as the StackedOR layer with W stack weights. To conclude, the base rule model is composed of a StackedOR layer for categorical variables, a logical AND layer and an OR layer. The formal grammar that this architecture can express is specified with the following production rules (see Appendix A for the full grammar): This grammar is also limited by the model architecture: conjunction contains at most one occurrence of each predicate and the total number of conjunction(s) is bounded by the number of hidden nodes.

3. CONVOLUTIONAL RULE NEURAL NETWORK

Our main contribution is to extend the base rule model for sequential data. We apply the base rule model as a 1D-convolutional window of fixed length l ∈ N over a sequence and retrieve all outputs as input for an additional disjunctive layer which we refer to as the ConvOR layer as shown in Fig 2 1 . The base rule model learns a DNF over the window size length and the ConvOR layer indicates where along the sequence that logical expression is true. If the evaluation of the logical expression is true all along the sequence then it can be described as a global pattern, otherwise the learned pattern represents a local pattern. The model input is now of size k l × n k and output of StackedOR layer (or input of the AND layer) is l × K. Other dimensions are not impacted. For simplicity in the following, K is fixed to 1 i.e. input data is composed of one categorical variable evolving sequentially. The method is still valid for K > 1. Filter: Base Rule Model (B at t-2 and D at t-1) or (D at t-1 and C at t-0) (B at t-5 and D at t-4) or (D at t-4 and C at t-3) B-D or D- C in sequence Figure 2 : Example of a trained CR2N architecture. The base rule model is applied as 1Dconvolutional window over the sequence (i.e. sliding window). The resulting boolean values are given as input of the ConvOR layer which indicates through its activated weights where along the sequence the expression learned by the base model is true. The output of the ConvOR layer is mapped to the label of the sequence y. For local patterns, the base model expression needs to be shifted accordingly to the ConvOR layer weights. For a real-domain application like fraud detection, by providing meaning to B, C and D, we could have for example if "receiving a transaction of amount X"(B) is followed by "emitting a transaction of amount X" (D) or "emitting a transaction of amount X"(D) is followed by "closing the bank account"(C) then class=fraud. With this approach, different sequence-dependent expressions can be extracted and their nature depends on the weights of the ConvOR layer ( The obtained weights thus translate to a rule grammar with the following production rules: rule → if expression then class = 1 else class = 0 expression → local pattern | global pattern (4) We introduce t, the position when the last observation in a sequence was made. With t being our reference, in a sequence of size N ∈ N, t -i refers to the moment of the i th observation before t (0 ≤ i ≤ N -1). A, B, C and D are toy binary input possible values for our categorical variable x c (they cannot be activated simultaneously at the same position t in the sequence). With those definitions, we list below examples of different sequence-dependent expressions that can be expressed with the proposed architecture (see Fig 2 ): A local pattern is an expression composed of predicates that are true at a specific position i, for example A at t-15. Based on Eq 3 we have: local pattern → base expression predicate → categorical expression at t-i | literal at t-i. A global pattern is an expression describing the presence of a pattern anywhere in the sequence, for example B-D in sequence is a global pattern where "-" sign refers to "followed by" and " * " correspond to any unique literal (equivalent to ∀i ∈ [0; N -1], B at t-i-1 and D at t-i 

4. TRAINING STRATEGY

To overcome training challenges attributed to binarized neural networks (Geiger & Team, 2020) , we use latent weights and enforce sparsity dynamically. We define a loss function that penalizes complex rules and the model is trained via automatic differentiation (Pytorch) with Adam optimizer. Latent weights The binary model parameters introduced above (W and , W or , W stack , W conv ) are trained indirectly via the training of a continuous parameter loc which is activated (binarized) by a sigmoid function (Kusters et al., 2022) . With such binary weights and continuous relaxation Eq 1 and 2 are differentiable with nonzero derivatives (Kusters et al., 2022) . As opposed to when using a straight through estimator (Qiao et al., 2021) , non-zero gradients are ensured during the backward pass. To overcome training limitations, we use a hard concrete distribution (Qiao et al., 2021; Louizos et al., 2018) . It rescales the weights and the random variable introduced during training prevents from obtaining local minima (Appendix B). Weight values are in [0, 1] during training, while for testing and rule extraction, a Heaviside is applied to them (≥ 0.5) to ensure strict binarization.

Loss function

We define the loss function L composed of a mean-squared error component along with a regularization term that penalize the complexity of the rule, L = Lmse + λΠ That regularization term Π, or penalty, evaluates the number of terminal conditions in the rule. In practice we use λ = 10 -5 . For a layer n of input size I and output size O, the number of terminal conditions per output corresponds to the weighted sum of the number of terminal conditions of each output of the previous layer i.e. Π layer n = I W layer n Π layer n-1 , a vector of size O. For the first layer, the StackedOR layer, Π stack is defined as the sum over the input dimension of the weights and we can then express the number of terminal conditions of the base rule model Π base . Π layer 0 = Π stack =    W 1 stack . . . W K stack    , Π base = Wor W and Π stack For optimizing Π for local patterns, we have to minimize the activated ConvOR layer weights. For global patterns, we want them to all be activated. A condition could be set on the sum of ConvOR layer weights (Eq 9) to shift from one optimization problem to the other but with loss of continuity and thus differentiability (interesting values of τ being M + l -1, the ConvOR layer input size, that would correspond to all ConvOR layer weights being equal to 1, or M -l + 1 that would allow for 2(l -1) weights to be 0, and corresponds to the padding required for properly accounting for sequence boundaries (Section 3)). Π local = Π base Wconv, Π global = Π base , Π * = Π global if Wconv ≥ τ Π local otherwise (9) Table 1 : Ground truth applied on sequences of letters (A to F) to generate synthetic unbalanced datasets 1, 2, 3 and 4 along with the distributions of the positive class. In the patterns, t refers to the position when the last observation in a sequence was made. Balanced datasets with same ground truth are generated and are referred to as the dataset number followed by the letter b (Appendix D). Ground truth # Distribution (%) C at t-4 1 14.2 A at t-6 and C at t-4 2 1.5 (A at t-6 and C at t-4) or (B at t-5 and C at t-3) 3 3.6 B-D in sequence 4 20.4 Due to non continuity of Π * in Eq 9, we choose to have two models with the same architecture for the two cases: the local and the global model respectively more relevant for their associated pattern. For the local model, all weights are trainable and Π = Π local . For the global model, weights in the ConvOR layer are fixed and set to 1, and Π = Π global . Enforced sparsity Sparsity of the model is crucial to learn concise expressions, the model needs to generalize without observing all possible instances at training time. The first requirement for that matter is sparsity in the base rule model. In addition to the regularization term in the loss function, we propose to use a sparsify-during-training method (Hoefler et al., 2021) and dynamically enforce sparsity in weights from 0% to an end rate r f set to 99% in our case (Lin et al., 2020) . Sparsify-during-training method can also benefit the quality of the training in terms of convergence by correcting for approximation errors due to premature pruning in early iterations but is highly dependant on the sparsification schedule (Hoefler et al., 2021) . Every 16 iterations s and for a total of s f training iterations, every trainable weight is pruned with a binary mask, m (of size of its associated weight and applied with Hadamard product (⊙)) (Lin et al., 2020; Zhu & Gupta, 2017) . We propose a mask based on the maximum of weight magnitude loc and pruning rate r (Zhu & Gupta, 2017) making the assumption that it contributes to generalization (Eq 10). This strategy can be more aggressive than state-of-art contributions (Lin et al., 2020) due to its dependency to the loc maximum value. During training, the model with the highest prediction accuracy on validation dataset and the highest sparsity (evaluated at each epoch) is kept. r = r f -r f 1 - s s f 3 , mi,j = |loci,j| ≥ r × maxi,j(loc), Ŵ = W ⊙ m (10) Additional training optimizations have been tested out such as for example using a binarized optimizer (Helwegen et al., 2019; Geiger & Team, 2020) , adding a scheduled cooling on the sigmoid of the binarized weights, alternating the training of each layer every few epochs (Qiao et al., 2021) or using a learning rate scheduler. Those techniques are not presented here but would be of interest for improving results on specific datasets.

5. EXPERIMENTS

In order to evaluate the validity and usefulness of this method, we apply it to both synthetic datasets and UCI membranolytic anticancer peptides dataset (Grisoni et al., 2019; Dua & Graff, 2017) .

Synthetic Datasets

We propose 8 synthetic datasets based on 4 ground truth expressions in both balanced and unbalanced distributions for discovering simple binary classification rules with local or global patterns as shown in Table 1 . There are 1000 sequences of letters (A to F) of different length from 4 to 14 letters in each of them (Mean around 9±3). Generation is detailed in Appendix D. Peptides Dataset Besides the synthetic datasets, real-world UCI anticancer peptides dataset composed of labeled one-letter amino acid sequences, is used (Grisoni et al., 2019; Dua & Graff, 2017) We run the experiments with two different window sizes (3 and 6) for the CR2N convolution filter size. We compare the two versions of the architecture: the local and global models described in Section 4 and study three different dynamic pruning strategies: none, dynamic enforced sparsity from epoch=0 and from epoch=30 (arbitrary). , it is at the cost of rule complexity for the local approach with averaged penalty values higher than 60 (and standard deviation higher than 50) compared to lower than 10 for the global model (and standard deviation lower than 5). It points out that the local model in that case requires on average at least more than 6 times more terminal conditions in the learned rule than the global model for comparable accuracies, but also that the weights initial states have a huge impact on the rule complexity when the rule grammar is not expressive enough (with no pruning). Those results are confirmed on real-world dataset with the peptides dataset, accuracies between the local and global models especially for a window size of 6 are comparable. However there is an order of magnitude difference for the penalty, global approach being more concise. It is important to note that by architecture the global approach has less weights to train and thus a much lower maximum penalty.

6. RESULTS

Datasets 2b and 3b benefit from a bigger window size (highly expected for dataset 3/3b due to ground truth pattern size) as shown in Fig 3(c ). Accuracies are also higher with window size 6 than 3 for the peptides dataset at the cost of also higher penalties (Table 2 ). The more expressive the model is i.e. the more patterns it can model, training limitations aside, the better for the performance. Of course any black box neural network with no such 1-1 rule mapping constrained architecture would reach 100% accuracy, but it is that mapping in particular that makes the model relevant, expressive and fully interpretable. Also, the best performances in accuracy for the peptides dataset (∼ 91%) are comparable to the best results (∼ 92%) obtained from classification with single kernels when applied to that same dataset in Nwegbu et al. (2022) , our model providing an additional fully-interpretable property. The presented model is also flexible due to its logical equivalence and can be inputted into other logical layers for deeper architectures to extend the rule grammar (Beck & Fürnkranz, 2021) . It can also be extended for timeseries, temporal aggregates or multi-classification problems. Other rule grammar extensions can be inspired by Linear Temporal Logic domain and regular expression pattern mining (De Giacomo et al., 2022) . However the more expressive the model is the more attention is required for training and rule complexity. . Then, regarding the performance in terms of accuracy, we can differentiate two cases: balanced and unbalanced datasets. Training of unbalanced datasets is more affected by the aggressive dynamic pruning strategy than balanced datasets with a drop in average of around 0.2 in accuracy for dataset 1 compared to an equivalent accuracy for balanced dataset 1 for example (Fig 3(e) ). The pruning strategy starting after 30 epochs is preferred in both cases. Average accuracies with a pruning strategy not starting immediately (30 epochs) are comparable to the ones obtain without pruning for balanced datasets. In terms of rule complexity, penalty values are lower with pruning and even lower when starting after 30 epochs in most cases (Fig 3(f) ). With our pruning strategy (Eq 10), we make the assumption that lower positive loc values are associated to overfitting or redundancy by taking into account that values closer to 0, i.e. on the sigmoid slope, are more likely to shift thus less 'certain'. As pointed out in early work by Prechelt (1997) , the dynamic pruning strategy helps to overcome possible lower generalization ability compared to a fixed pruning which could explain cases of better performance (peptide dataset local model window size of 3 for example). Prechelt proposed a different pruning strategy based on a generalization loss to characterize the amount of overfitting. When this strategy is relevant in more general cases and can be applied to many different networks, our strategy is tailored for minimizing positive trainable parameter values. Sparsity of the model is also induced via the regularization term Π in the loss function L (Eq 7). While this method is parameterized with a relative importance of sparsity for training optimization and provides an uncontrolled target sparsity, a dynamic pruning strategy is easier to control for both target sparsity and accuracy but is highly dependent on the pruning schedule (Hoefler et al., 2021) . An interesting point is made by Hoefler et al. (2021) about the convolutional operator that 'can be seen as a sparse version of fully-connected layers'. That level of forced sparsity in our model is therefore defined by the fixed window size model parameter with respect to the maximum sequence length. The ideal sparser window size would be the size of the maximum temporal hidden pattern in distribution that can only be approximated with external or expert knowledge and/or tuned with trial and error. With or without a dynamic pruning strategy, for highly unbalanced datasets (2 and 3), experiments have shown that the training strategy of the model is not suitable. Indeed most of them, label everything with the majority class (50% balanced accuracy). It corresponds to the specific case of learning an empty rule (penalty=0) (Fig 3(a, c, e) ). For unbalanced datasets 1 and 4, their best models do not reach on average the same accuracies as in their balanced versions. Overall this training strategy is both the key and the main limitation of our approach: it can provide a sharp concise rule with minimal redundancy and simplified logical expression but it is highly dependent on numerous model, training and pruning parameters and is not suited as is for highly unbalanced datasets.

7. CONCLUSION

To conclude, we presented a 1D-convolutional neural architecture to discover local and global patterns in sequential data while learning binary classification rules. This architecture is fully differentiable, interpretable and requires sparsity that is enforced dynamically. One main limitation is its dependence to the window size and sparsity scheduler parameters. Further work will consist in integrating this block into more complex architectures to augment the expressivity of the learned rules as well as extending it for multi-classification. lengths range from 5 to 38 letters (Mean: 17 ± 5.5). We transform this multi-classification problem into a binary classification problem (as done in Nwegbu et al. (2022) ). Class 'inactive-virtual' is the positive class (750) and all the other are combined as the negative class (199) . No other processing of the data is necessary and we leave it as is.

D SYNTHETIC DATASETS GENERATION

Balanced datasets are generated randomly with the same ground truth as unbalanced datasets. Then, they are upsampled until the minority class represents half of the goal dataset size and appropriate number of majority class are randomly removed.

E EXPERIMENTAL SETTING

The loc-parameters for weights computation are initialized with xavier uniform initialization method (Glorot & Bengio, 2010) . The loss function is described in Eq 7 and depends on the MSE loss and regularization coefficient λ = 10 -5 . The adam optimizer is used with a fixed learning rate set to 0.1 and a run consists of 200 epochs. Experiments were run on CPU on a MacBookPro18,2 (2021) with Apple M1 Max chip, 10 Cores, 32 GB of RAM and running macOS Monterey Version 12.4.

F FULL EXPERIMENT RESULTS

Table 3 : Performance metrics obtained for the different models, window size and pruning strategy on the synthetic datasets. Values are followed by the standard deviation over the 10 executions with different weights initializations. (Bal Acc: balanced accuracy, Epoch: best epoch). 



A natural extension for sequential data of the base rule architecture would be to extend it with an explicit recursion of the base rule model, similar to a RNN. This approach was tested but faced the same limitations as any classical RNNs, i.e., vanishing gradients and only captures short-term dependencies.



Figure 1: Example of a trained base rule model architecture for the rule if (B 1 and D 2 ) or (D 2 and C 3 ) then 1 else 0 on 3 categorical variables x c1 , x c2 and xc3 (x c k ∈ {A k , B k , C k , D k }).For simplicity, the truth value of x c1 = B 1 is replaced by B 1 for example. Plain (dotted) lines represent activated (masked) weights. An example evaluation of the model is represented with the filled neurons (neuron=1) for the binary input x c1 = B 1 , x c2 = D 2 and x c3 = A 3 .

rule → if base expressionthen class = 1 else class = 0 base expression → conjunction | conjunction ∨ base expression conjunction → predicate | predicate ∧ conjunction predicate → categorical expression | literal categorical expression → categorical literal | categorical literal ∨ categorical expression categorical literal → xc = α c 0 | . . . | xc = α c nc (or simply α c 0 | . . . | α c nc ) literal → x1 | x2 | . . . (3)

Fig 1 is also still valid with a change of index, k is now referring to the position in the window of size l.

Fig 2). If all the weights, W conv , of the ConvOR layer are activated (i.e. equal to 1), the logical expression learned by the base model is true in all the sequence: a global pattern is learned. If only some of the weight of the ConvOR layer are activated, the logical expression learned by the base model is valid only in the window associated to that weight: a local pattern is learned. The base model logical expression is modified accordingly to match that shift (see example in Fig 2 with a shift of 3 sequential steps).

Figure 3: Representations of key results obtained on the synthethic datasets. Error bars represent the standard deviation over the 10 executions with different weights initializations. Full results are available in Appendix F.

It is also highlighted by comparing experiments with local or global model and experiments with different window sizes. First of all, the accuracy of the local model is higher compared to the global model on balanced synthetic datasets 1, 2 and 3 (Fig 3(a)). For balanced and unbalanced dataset 4, both models achieve very high accuracies (> 95%). However, as shown on Fig 3(b)

Sparsity and training strategy The importance of the model sparsity is pointed out by the experiments with different pruning strategies. First, looking at training scenarios, both on synthethic and peptide datasets, experiments with sparsity-during-training approaches reach the best model faster on average than without (lower best epoch Fig 3(d))

Additional special cases can be pointed out such as the learning of a global pattern over an interval (e.g. B-*-D in window [t-6; t-3]) or the learning of sequence characteristics dependant expression such as 4 ≤ len(sequence) ≤ 6 based on the sequence length (not shown on Fig 2 but it corresponds to a specific case where the base model has learned an always true rule). Also, it is important to note that base expression and conjunction in both grammars are bounded by the fixed window size l.To ensure full equivalence between the model and rule, sequences boundaries need to be considered, especially for global patterns. All sequences are padded on both ends with a sequence of 0 of size l -1 (not shown for simplicity on Fig2). Also sequences of different lengths are supported by creating a model based on the maximal available sequence length M in the data and padding shorter sequences with a sequence of 0 of appropriate length. ConvOR layer input size is then M + l -1.With this one architecture we can model both local and global patterns. However for optimization reasons detailed below, we choose to differentiate the two into two distinct models: a local and a global model. The ConvOR layer weights for the global model are set and fixed to 1 during training.

Experimental Setting All datasets are partitioned in a stratified fashion with 60% for training, 20% for validation and 20% for testing datasets and we use a batch size of 100 sequences. The hidden size in the base rule model is set to the double of the input size of the AND layer (which is the window size of the convolution). More details on experimental setting can be found in Appendix E. At each epoch (200 in total), we evaluate the model against the validation dataset and keep the model with the highest accuracy and in case of equality the model with lowest penalty. For each experiment, we run the algorithm 10 times with different weights initializations. Resulting metrics are averaged over these runs.

Performance metrics obtained for the different models, window size and pruning strategy on the peptides dataset, along with the standard deviations over the 10 executions with different weights initializations. (Bal. Acc.: balanced accuracy, Epoch: best epoch).Rule grammar and expressivityThe importance of the rule model expressivity can be seen concretely by comparing the different patterns the local and global models have learned for dataset 3b for example: 1.

ACKNOWLEDGMENTS

We would like to thank Shubham Gupta for helpful discussions and constructive feedback as well as Yusik Kim for reviewing the manuscript. This work has been partially funded by the French government as part of project PSPC AIDA 2019-PSPC-09, in the framework of the "Programme d'Investissement d'Avenir".

AUTHOR CONTRIBUTIONS

M.C. and R.K. designed the model. R.K. encouraged M.C. to investigate the use of convolutions and supervised the findings of this work. P.B. and F.F. helped supervise the project. M.C. validated the training strategy and carried out the implementation and the experiments. M.C. wrote the manuscript. All authors provided critical feedback and helped shape the manuscript.

A CONTEXT-FREE GRAMMAR

Context-free grammar (Chomsky, 1956 ) A context free-grammar is a 4-tuple G = (V T , V N , S, R) where -V T , is finite set of terminals or terminal elements in the language that form the alphabet of the language. -V N , disjoint from V T , is a finite set of non terminal elements (variables) that define a sublanguage of L. We note V = V N ∪ V T , the vocabulary of the grammar. -S ∈ V N , is the start symbol or variable that defines the whole sentence.-R is a finite set of rules or production rules of the form A → w with A ∈ V N and w ∈ V *In the following, we have 3 different types of terminal elements on the syntax level:• reserved words, distinguished with the following style : reserved • signs, such as for example -, * , ... • other terminal elements that are defined prior to the grammar, distinguished with the following style: terminal Production rules presented in the paper (Eq 3, Eq 4, Eq 5 and Eq 6) define grammars when associated to values for V T , V N and S.Here is an example for the base rule model grammar with production rules in Eq 3. C PEPTIDES DATASET UCI anticancer peptides dataset (Grisoni et al., 2019 ) (Available on Dua & Graff (2017) ) is composed of one-letter amino acid sequences (of variable length) and each sequence is labeled with its anticancer activity on breast cancer cell lines. The dataset provides 4 classes with the following distribution: 83 inactive-exp, 750 inactive-virtual, 98 moderately active and 18 very active. Sequences 

