INDUCING MEANINGFUL UNITS FROM CHARACTER SE-QUENCES WITH DYNAMIC CAPACITY SLOT ATTENTION

Abstract

Characters do not convey meaning, but sequences of characters do. We propose an unsupervised distributional method to learn the abstract meaning-bearing units in a sequence of characters. Rather than segmenting the sequence, our Dynamic Capacity Slot Attention model discovers continuous representations of the objects in the sequence, extending an architecture for object discovery in images. We train our model on different languages and evaluate the quality of the obtained representations with forward and reverse probing classifiers. These experiments show that our model succeeds in discovering units which are similar to those proposed previously in form, content and level of abstraction, and which show promise for capturing meaningful information at a higher level of abstraction.

1. INTRODUCTION

When we look at a complex scene, we perceive its constituent objects, and their properties such as shape and material. Similarly, what we perceive when we read a piece of text builds on the word-like units it is composed of, namely morphemes, the smallest meaningful units in a language. This paper investigates deep learning models which discover such meaningful units from the distribution of character sequences in natural text. In recent years, there has been an emerging interest in unsupervised object discovery in vision (Eslami et al., 2016; Greff et al., 2019; Engelcke et al., 2020) . The goal is to segment the scene into its objects without supervision and ideally obtain an object-centric representation of the scene. These representations should lead to better generalization to unknown scenes, and additionally should facilitate abstract reasoning over the image. Locatello et al. (2020) proposed a relatively simple and generic algorithm for discovering objects called Slot Attention, which iteratively finds a set of feature vectors (i.e., slots) which can bind to any object in the image through a form of attention. Inspired by this line of work in vision, our goal is to learn a set of abstract continuous representations of the objects in text. We adapt the Slot Attention module (Locatello et al., 2020) for this purpose, extending it for discovering the meaningful units in natural language character sequences. This makes our work closely related to unsupervised morphology learning (Creutz, 2003; Narasimhan et al., 2015; Eskander et al., 2020) . However, there are fundamental differences between our work and morphology learning. First, we learn a set of vector representations of text which are not explicitly tied to the text segments. Second, our model learns its representations by considering the entire input sentence, rather than individual space-delimited words. These properties of our induced representations makes our method more appropriate for inducing meaningful units as part of deep learning models. In particular, we integrate our unit discovery method on top of the the encoder in a Transformer auto-encoder (Vaswani et al., 2017) , as depicted in Figure 1 , and train it with an unsupervised sentence reconstruction objective. This setting differs from previous work on Slot Attention, which has been tested on synthetic image data with a limited number of objects (Locatello et al., 2020) . We propose several extentions to Slot Attention for the domain of real text data. We increase the capacity of the model to learn to distinguish a large number of textual units, and add the ability to learn how many units are needed to encode sequences with varying length and complexity. Thus, we refer to our method as Dynamic Capacity Slot Attention. Additionally, as a hand-coded alternative, we propose stride-based models and compare them empirically to our slot-attention based models. To evaluate the induced representations we both qualitatively inspect the model itself and quantitatively compare it to previously proposed representations. Visualisation of its attention patterns shows that the model has learned representations similar to the contiguous segmentations of traditional tokenization approaches. Trained probing classifiers show that the induced units capture similar abstractions to the units previously proposed in morphological annotations (MorphoLex (Sánchez-Gutiérrez et al., 2018; Mailhot et al., 2020) ) and tokenization methods (Morfessor (Virpioja et al., 2013) , BPE (Sennrich et al., 2016) ). We propose to do this probing evaluation in both directions, to compare both informativeness and abstractness. These evaluations show promising results in the ability of our models to discover units which capture meaningful information at a higher level of abstraction than characters. Figure 1 : The sketch of our model. First, the Transformer encoder encodes the sequence and then, Slot Attention computes the slot vectors (highlighted text). Next, the L 0 Drop layer dynamically prunes out the unnecessary slots. Finally, the decoder reconstructs the original sequence. To summarize, our contributions are as follows: (i) We propose a novel model for learning meaning-bearing units from a sequence of characters (Section 2). (ii) We propose simple stridebased models which could serve as strong baselines for evaluating such unsupervised models (Section 2.4). (iii) We analyze the induced units by visualizing the attention maps of the decoder over the slots and observe the desired sparse and contiguous patterns (Section 4.2). (iv) We show that the induced units capture meaningful information at an appropriate level of abstraction by probing their equivalence to previously proposed meaningful units (Section 4.4).

2.1. PROBLEM FORMULATION

Given a sequence of N characters X = x 1 x 2 . . . x N , we want to find a set of meaning-bearing units (slots) M = {m 1 , . . . , m K }, which could best represent X in a higher level of abstraction. As an example, consider the sequence "she played basketball", where we expect our slots to represent something like the set of morphemes of the sequence, namely {she, play, -ed, basket, -ball}.

2.2. OVERVIEW

We learn our representations through encoding the input sequence into slots and then reconstructing the original sequence from them. Particularly, we use an auto-encoder structure where slots act as the bottleneck between the encoder and the decoder. Figure 1 shows an overview of our proposed model, Dynamic Capacity Slot Attention. First, we encode the input character sequence by a Transformer encoder (Vaswani et al., 2017) , which gives us one vector per character. Then, we apply our highercapacity version of a Slot Attention module (Locatello et al., 2020) over the encoded sequence, to learn the slots. Intuitively, Slot Attention will learn a soft clustering over the input where each cluster (or respectively slot) corresponds to a candidate meaningful unit in the sequence. To select which candidates are needed to represent the input, we integrate an L 0 regularizing layer, i.e., L 0 Drop layer (Zhang et al., 2021) , on top of the slots. Although the maximum number of slots is fixed during the course of training, this layer ensures that the model only uses as many slots as necessary for the particular input. This stops the model from converging to trivial solutions for short inputs, such as passing every character through a separate slot. Finally, the Transformer decoder reconstructs the input sequence autoregressively using attention over the set of slots.

2.3. MODEL

Encoder. We use Transformer encoder architecture for encoding our sequence (Vaswani et al., 2017) and obtain the representation X ′ = x ′ 1 x ′ 2 . . . x ′ N from our input sequence X. Slot Attention for text. After encoding the character sequence, we use our extended version of Slot Attention for discovering meaningful units of the input character sequence. Slot Attention is a Algorithm 1 Slot Attention module (Locatello et al., 2020) . q, k, v map the slots and inputs to a common dimension D and T denotes the number of iterations. Require: inputs ∈ R N ×D input , slots ∼ N (µ, diag(σ)) ∈ R K×D slots inputs = LayerNorm(inputs) for i = 1 to T do slots prev = slots slots = LayerNorm(slots) attn = Softmax( 1 √ D k(inputs) .q(slots) T , axis = 'slots') updates = WeightedMean(weights = attn + δ , values = v(inputs)) slots = GRU(states = slots prev, input = updates ) slots += MLP(LayerNorm(slots)) end for return slots recent method for unsupervised object representation learning in vision (Locatello et al., 2020) . It learns a set of feature vectors (slots) by using an iterative attention based algorithm. Algorithm 1 shows the pseudo code of this method. Abstractly, in every iteration, it takes the following steps. First, it computes an attention map between the slots and the inputs, by slots acting as queries and inputs as keys. Then, it normalizes the attention map over slots, which makes the slots compete for representing each token of the input. Afterwards, it computes the slots' updates as the (normalized) weighted mean over the attention weights and the input values. Finally, it updates the slots through a Gated Recurrent Unit (GRU) (Cho et al., 2014) followed by a residual MLP. This process iterates a fixed number of times. In Locatello et al. (2020) , the slots are initialized randomly by sampling from a Normal distribution with shared learnable parameters µ and σ, i.e., slot i ∼ N (µ shared , σ shared ). (1) In other words, the initial value of slots are independent samples of a single Normal distribution with learnable parameters. We found in our experiments that this initialization method does not lead to good results in the text domain. In contrast with the experimental setting of Locatello et al. (2020) , which used artificially generated data with a small number and range of objects, real language requires learning about a very large vocabulary of morphemes, and each example can have a large number of morphemes. This suggests that a model for language needs a more informative initialization with more trainable parameters, in order to have the capacity to learn about this large vocabulary and distinguish all the objects in a long sentence. To investigate this issue, we propose another initialization for adapting Slot Attention to text. We consider a separate learnable µ per slot and we fix the σ to a predefined value for all the slots. Namely, the slots are initialized as slot i ∼ N (µ i , σ constant ). (2) By assigning a separate µ for each slot, the initialization has many more trainable parameters. This allows the model to learn about different kinds of units, such as ones that occur at different positions, or ones that have different types of forms, but we do not make any assumptions about what those differences might be. In addition, the intuition behind fixing the σ, is to prevent it from collapsing to zero which would in turn, force the slots to compress the information in a meaningful way. In particular, since the number of possible n-grams in text is finite but the slots can have any continuous value in the space of R D slots , the slots tend to learn an arbitrary mapping from n-grams in the input to the slots, while turning σ to zero. Thus, there is no need for the slots to learn the underlying meaning-bearing units. In other words, by imposing a constant noise on slots through the constant σ, we limit the information which can be passed through each slot, from the information theoretic point of view. Experiments showing the greater effectiveness of Equation 2 over Equation 1 are given in Appendix A.1. We then obtain the set of slots M = {m 1 . . . m K } = SlotAttention(X ′ ). Neural sparsification layer: L 0 Drop. The number of units needed to represent a sequence varies among different sequences in the data. In contrast to the object discovery work where the data is generated synthetically and thereby, the number of objects in the scene is known beforehand, we do not make any assumptions about the number of units required. Therefore, we consider an upper-bound over the number of required units and prune the extra ones per input sequence. We accomplish this goal by using a neural sparsification layer called L 0 Drop (Zhang et al., 2021) . It allows our model to dynamically decide on the number of required units for every input sequence. This layer consists of stochastic binary-like gates g = g 1 . . . g K that for every input m i works as L 0 Drop(m i ) = g i m i . (3) When g i is zero the gate is closed and when it is one the whole input is passed. Each gate is a continuous random variable in the [0, 1] interval, sampled from a hard-concrete distribution (Louizos et al., 2018) . This distribution assigns most of its probability mass over its endpoints (i.e., 0 and 1) in favour of the sparsification goal. Specifically, g i ∼ HardConcrete(α i , β, ϵ) where β and ϵ are hyperparameters. α i is predicted as a function of the encoder output m i , i.e., logα i = m i w T , where w is a learnable vector. This allows the model to dynamically decide which inputs to pass and which ones to prune. The L 0 penalty, which yields the expected number of open gates, is computed as: L 0 (M ) = k i=1 1 -p(g i = 0|α i , β, ϵ), where the probability of g i being exactly 0 is provided in closed form in Louizos et al. (2018) . We follow the same approach as Louizos et al. (2018) at evaluation time and consider the expectation of each gate as its value. We refer to the pruned slots after applying the L 0 Drop layer as M ′ = m ′ 1 . . . m ′ K . Decoder. Lastly, we regenerate the input sequence from the set of slots by using a simple, shallow decoder. To this end, we use a one-layer Transformer decoder (Vaswani et al., 2017) with a single attention head over the slots. A simple decoder forces the slots to learn representations with a straightforward relationship to the input, which we expect to be more meaningful. In other words, we do not use a powerful decoder because it would be able to decode even low quality representations of the input, which are less meaningful (Bowman et al., 2015) . Training objective. We train our model end-to-end by using Gumble trick for sampling HardConcrete variables (Maddison et al., 2017; Jang et al., 2017) . The training objective is L rec (X, M ′ ) + λ L 0 (M ) = -log (E g [p(X|M ′ )]) + λ L 0 (M ) (5) ≤ E g [-log p(X|M ′ )] + λ L 0 (M ) = L(X), which consists of the reconstruction loss from the decoder (L rec ) and the L 0 penalty for the open gates. Hyperparameter λ, the sparsification rate, controls the trade-off between the two losses. In practice, we find that in order to impose enough sparsity in the slots, we should slightly increase λ during the course of training using scheduling techniques.

2.4. STRIDE-BASED MODELS

We propose a simple hand-crafted alternative model to our induced units which can gain acceptable results in terms of performance. We design this model by replacing our Dynamic Capacity Slot-Attention module with a linear-size bottleneck. In particular, we take 1 out of every k encoder outputs and down project them (R Dinput → R D slots ) to obtain the representations. In other words, we only pass certain encoder outputs based on their position and drop the rest. M = DownProject(x ′ 1 x ′ k+1 x ′ 2k+1 . . . x ′ nk+1 ) where n = ⌊ N -1 k ⌋. We can get different alternative models by varying the stride k. The training objective is the reconstruction loss L rec (X, M ) = -log p(X|M ). The idea of using stride-based models has also been used in Clark et al. (2022) ; Tay et al. (2022) in a different context and setup.

3. RELATED WORK

Unsupervised Object Discovery. There is a recent line of research in the image domain for discovering objects in a scene without explicit supervision, and building an object-centric representation of them. Most of this work is built around the idea of compositionality of the scenes. MONet (Burgess et al., 2019) and GENESIS (Engelcke et al., 2020) similarly use a recurrent attention network for learning the location masks of the objects. Greff et al. (2016; 2017; 2019) ; Van Steenkiste et al. (2018) ; Emami et al. (2021) model the scene as a spatial Gaussian mixture model. Furthermore, AIR network (Eslami et al., 2016) and its variants (Crawford & Pineau, 2019; Lin et al., 2020) model objects from a geometric perspective by defining three specific latent variables. Lately, Locatello et al. (2020) propose an attention-based algorithm (namely Slot Attention) to learn object representations (slots). In contrast to this line of work in vision, our approach is specifically designed for text. We use additional components (e.g., L 0 Drop layer) in our architecture to resolve the requirements of modeling textual data. Furthermore, our model is trained and evaluated on real text datasets, in contrast to these previous models which have only been shown to be effective on synthetic scenes. Unsupervised morphology learning. This subject has been of interest for many years in the NLP field (Elman, 1990; Creutz & Lagus, 2002; Baroni et al., 2002) . Morphemes have strong linguistic motivations, and are practically important in many downstream tasks because they are the smallest meaning-bearing units in a language (Can & Manandhar, 2014) . Many approaches have been proposed for discovering the underlying morphemes or morpheme segmentations. Morfessor variants are based on probabilistic machine learning methods (MDL, ML, MAP) for morphological segmentation (Creutz, 2003; Creutz & Lagus, 2002; 2005; 2007; Virpioja et al., 2013) . Some researchers take a Bayesian approach for modeling word formation (Poon et al., 2009; Narasimhan et al., 2015; Bergmanis & Goldwater, 2017; Luo et al., 2017) . Adaptor Grammars are another approach for modeling morphological inflections (Sirts & Goldwater, 2013; Eskander et al., 2016; 2019; 2020) . In addition, Xu et al. (2018; 2020) built their models upon the notion of paradigms, set of morphological categories that can be applied to a set of words. Moreover, Soricut & Och (2015) ; Üstün & Can (2016) extract morphemes by considering the semantic relations between words in the continuous embedding space. Cao & Rei (2016) propose to learn word embeddings by applying a bi-directional RNN with attention over the character sequence. Furthermore, Ataman et al. ( 2020) model word formation as latent variables which mimic morphological inflections in the task of machine translation. Our work differs from the previous work in classical morphology learning in two ways. First, instead of explicitly discovering morphemes, we learn a set of continuous vector representations of the input, which would then need to be processed to extract the morphemes. The model itself has no explicit relation between these unit representations and segments of the input. Second, our model learns representations of an entire input sentence, rather than individual space-delimited words. This makes fewer assumptions about morphemes, and considers the context of the words in a sentence. Our work is similar to Ataman et al. ( 2020) in modeling morphology implicitly in the latent space. However, we employ a self-supervised objective for our purpose which is more general compared to their supervised loss, as we do not need labeled data. Unsupervised character segmentation. Learning to segment a character sequence in unsupervised fashion is another relevant area to our work. Chung et al. (2017) propose Hierarchical Multi-scale RNNs for modeling different levels of abstractions in the input sequence. In the language modeling task, they observe that the first layer is roughly segmenting the sequence into words, namely at space boundaries. Sun & Deng (2018) propose Segmental Language Models for Chinese word segmentation. Moreover, in (Kawakami et al., 2019) , the authors design a model to learn the latent word segments in a whitespace-removed character sequence with a language modeling objective. As we mentioned earlier, we learn continuous vector representations of text which is different from explicitly detecting discrete character segments. Subword discovery algorithms. This set of algorithms have become a standard component of NLP models in recent years. Byte-Pair-Encoding (BPE) (Sennrich et al., 2016) iteratively merges the two consecutive tokens with the highest frequency. Word-piece (Schuster & Nakajima, 2012) , sentence-piece (Kudo & Richardson, 2018) and unigram LM (Kudo, 2018) are other similar subword tokenization algorithms. In contrast to these methods, which mostly use local statistical information of the data, our model is trained over complete sentences to learn an abstract sophisticated representation.

4. EXPERIMENTS

We evaluate our unsupervised model both by visualizing attention maps (4.2), and by probing the slot vectors (4.4).

4.1. EXPERIMENTAL SETUP

We apply our model to languages from different morphological typologies. We select English (EN), German (DE), French (FR), Spanish (ES) and Czech (CS) from the fusional family and Finnish (FI) from the agglutinative typology. For English we use the raw Wikitext2 dataset (Merity et al., 2017) . For the rest we use Multilingual Wikipedia Corpus (MWC) (Kawakami et al., 2017) . As for the models, we use a standard Transformer architecture (Vaswani et al., 2017) with model dimension 256. The encoder consists of 2 layers with 4 self-attention heads and the decoder consists of 1 layer with 1 self-attention head and 1 attention head over the slots. We feed in the sentences with less than 128 characters to our model and consider the number of slots as 64 (half of the maximum input length). In addition, we take the dimension of slots as 128. We scheduled the λ parameter in the training loss to start with a low value and exponentially increase it every 10 epochs until it reaches a certain limit. We obtain this limit manually in a way that the final number of open gates roughly equals the average number of BPE tokens in a sequence. More details of the settings are available in the Appendix B.

4.2. VISUALIZATION

In order to show some qualitative results of our model, we visualize the attention maps for generating every output, shown in Figure 2 . In particular, we show the attention of the decoder over slots when generating every output character. Interestingly, although we do not impose any sparsity in the decoder's attention weights, the attention maps are quite sparse. Namely, at each generation step only a few slots are attended, and each slot is attended while generating only a few characters. In addition, although we do not impose any bias towards discovering segments of the input, the characters which are generated while attending to a given slot are contiguous in the string (the vertical bands in Figure 2 ). We believe that the emergence of contiguous spans is a result of the bottleneck we create with our Dynamic Capacity Slot Attention. This means that the model is trying to put correlated information about the input in the same vector, so that it can represent the string more efficiently. The strongest correlations in the character string are local in the input, so each slot tends to represent characters which are next to each other in the input. In early steps of training, when the sparsity ratio (λ) is small, each slot tends to represent a bigram of characters (2a) and later on, trigrams (2b). These observations confirm the necessity of the L 0 Drop layer for converging to better units. In particular, as the ratio increases, the number of active slots reduces and they become more specialized in representing contiguous meaning-bearing units of input. For instance, the word cooking in 2c is represented by two slots cook and ing. That these segments roughly correspond to the morphemes which we want the model to discover, is verified quantitatively in the probing experiments in Section 4.4.

4.3. PROBING METHOD

Since our model is unsupervised and does not use artificially generated data, there is no obvious gold standard for what units it should have learned. To quantitatively evaluate how well it performs, we freeze the trained model and train probing classifiers to measure to what extent the discovered units capture the same information as previously proposed meaningful units. We use multiple previously proposed representations which have been shown to be good levels of abstraction for a range of NLP applications. If the induced units provide a similar level of abstraction, separated into units in a similar way, then we can expect that they will also be effective in NLP applications. We assume that two representations provide the same level of abstraction if there is a one-to-one mapping between their respective units such that two aligned units contain the same information, meaning that each unit can be predicted from the other. Predicting the previously proposed unit from our induced unit is a probing task, as used in previous work (Belinkov & Glass, 2019) , which we specify as forward probing. We refer to the prediction of our induced unit from the previously proposed unit as a reverse probing task, which we believe is a novel form of evaluation. We compare various models on their trade-off between forward and reverse probing. Probing data. We consider three target representations which either have linguistic or practical evidence of being effective in NLP applications. For the languages for which an in-depth linguistic morphological analysis is available, we compare to these morphemes (i.e., EN and FR) ( Žabokrtský et al., 2022; Sánchez-Gutiérrez et al., 2018; Mailhot et al., 2020) . Alternatively, as linguistically inspired units, we have the Morfessor (Virpioja et al., 2013) outputs, and BPEs (Sennrich et al., 2016) as frequency-based subwords. Forward probing. This is the common way of probing where we want to measure if our induced units include the information about the target representation. We train a 2 layer MLP as our probing classifier f for this purpose. We apply the classifier with shared parameters to each of our slots individually and obtain a set of predictions {f (m ′ 1 ), f (m ′ 2 ), . . . , f (m ′ K )}, one per slot. As we are dealing with a set, during training we need to find a one-to-one matching between the classifier's predictions and the target tokens. Therefore, we follow Locatello et al. (2020) to use the Hungarian matching algorithm (Kuhn, 1955) for finding the match which minimizes the classification loss. We consider the complete set of slots after applying the L 0 Drop layer as the inputs to our classifier. Slots whose L 0 Drop gate is closed are simply input as zero vectors. This gives us a fixed number of vectors. The two sides of matching should have the same size to obtain a one-to-one match, therefore, we add an extra target label (i.e., empty) for representing the pruned slots. Due to the fact that many slots are pruned out, considering a measure like accuracy could be misleading, since a classifier which outputs empty label will achieve very high accuracy. Therefore, we build a confusion matrix as follows. We consider all non-empty labels as positive and the empty ones as negative, and we report precision (P), recall (R) and F1 measure, to better reflect what the slots have learned. Reverse probing. To evaluate whether the induced units capture the same level of abstraction, we also need to evaluate whether the induced units abstract away from the same information as the previously proposed units. We propose to measure this by training reverse probing classifiers, which predict each induced unit from its aligned target unit. Because the induced unit is a continuous vector, we predict the parameters of a d-dimensional Gaussian distribution with diagonal covariance, and measure the probability of the induced vector in this distribution. We first pass the tokens into an Embedding layer and then apply a 2 layer MLP on top of each embedding individually to predict the parameters of the distribution. We take the mappings between the tokens and slots from the trained (forward) probing classifier described previously. As the objective, we maximize the log-probability of representations under the predicted distributions. We consider the loss on the test set as our evaluation measure.

4.4. PROBING RESULTS

Given a target set of units and a trained model for finding units, we align the two sets of units and use forward probing to see if the induced units contain as much information as the target units and reverse probing to see if the induced units contain more information than the target units. Optimizing both measures requires having the same information as each target unit in the associated induced vector, and hence having the same level of abstraction. The primary mechanism we have to control the amount of information in an induced representation is to control the number of units. More units means that there are more opportunities for the alignment to find a unit which correctly predicts each target unit, but makes it harder to predict all the units from the available target units. We take two approaches to controlling the number of induced units. First, we fix the number of units to a target number, which allows an efficient comparison of models with only forward probing. Second, we vary the number of units, which allows us to evaluate how well different models can match the informativeness trade-off of the target representation. Targeted informativeness. As a less computationally expensive initial evaluation, we use hyperparameters to set the number of induced units to approximately the same as the number of target BPE tokens. For the stride-based models, we simply set the stride according to the average target number Figure 3 : 2D graphs showing the informativeness trade-off for slot-attention models (blue points) and the stride-based models (red points). The x-axis shows the F1 measure for the forward probe and the y-axis shows the negative log-probability loss for the reverse probe, so better models are lower and further to the right. of tokens (i.e., stride= 6). For the slot-attention models, we empirically set hyperparameters so that the number of open gates in the L 0 Drop layer is approximately the same as the target number. Given that all models output approximately the same number of units, we can simply compare them with standard (forward) probing. We use probing to compare slot attention and stride-based models to an uninformative baseline representation (untrained), thereby controlling for the power of the probing classifier to learn arbitrary mappings (Conneau et al., 2018; Oord et al., 2018) . This baseline is the set of slot vectors output by a randomly-initialized slot attention model without any training. Table 1 shows the results of the forward probing tasks on different languages. As the results show, the trained slots achieve much higher performance in both tasks in comparison to the random baselines. Our model achieves very high precision in predicting the non-empty labels. Its performance is weaker on the recall side, but here the improvement over the untrained model is even more pronounced. This seems to be due to the imbalance between empty and non-empty labels in the training set, where the empty labels comprise around 66% of the data for the probing classifier. For this reason, below we will use F1 as our overall performance measure for forward probing. The comparable stride-based model performs similarly to our slot attention model, but generally not as well. The performance gain of our slot-attention models to the stride-based models is more noticable on Morfessor targets (+0.07 average F1), and for the two languages where we have real morpheme annotations, which implies that our slot-attention based method is more effective in discovering linguistically meaningful units. There is no clear difference in our model's performance on different targets, and it performs equally well on both the agglutinative language (Finnish) and the fusional languages. We further show some examples from the predictions of the classifiers in the Appendix A.4. Informativeness trade-off. To vary the number of induced units, we use hyper-parameters. For the stride-based models, this simply implies varying the length of the stride. For the slot-attention models, to be able to more easily control the number of units we slightly change the objective of slot-attention models to explicitly control the number of open slots. Thus, we replace the L 0 term in Eq 5 with max(L 0 , r), where r indicates the desired number of open slots. In this setup, the model will close L 0 gates until it reaches the limit r. With the results of the probings in the two directions, we can plot two-dimensional graphs to compare different models. We train stride-based models with strides={2, 3, 4, 5, 6} and slot-attention based models with r = input length stride , which would lead both models to have the same number of valid units. Figure 3 plots this informativeness trade-off for English, French and Finnish with different target representations. Overall, the slot-attention based models provide better representations in comparison to the stride-based models, since these models are nearer to the lower right corner of the plot. This is clearer for English and French, with the results for Finnish being more mixed. We believe the stride-based models contain extra information which degrade the performance of the reverse probe and causes them to have higher evaluation loss. These results justify our design choices for the slot-attention based models to achieve the goal of discovering meaningful units which have the same level of abstraction as the previously proposed units. In addition, although our stride-based models fall behind the slot-attention based models, they can be utilized as strong baselines for evaluating such unsupervised models and to find acceptable abstract units which can omit the need for tokenizing text as a preprocessing step.

5. CONCLUSIONS

In this paper, we propose Dynamic Capacity Slot Attention for discovering meaningful units in text in an unsupervised fashion. We use an auto-encoder architecture for encoding a character sequence into a set of continuous slot vectors and decoding them to reconstruct the original sequence. We enforced the set of slots to act as a bottleneck in transferring information between the encoder and the decoder by adding a constant noise to their vectors and integrating an L 0 regularizing layer on top of the slots which only retains the necessary vectors. In addition, we propose a set of stride-based models which could serve as an alternative to our main model. We evaluate our model by probing the equivalence between the pruned slots and predefined tokens. In particular, we propose to do reverse probing as well as the normal way of probing. Our experiments show that our representations effectively capture meaningful information at a higher level of abstraction. Limitations We learn a set of abstract continuous vector representations of the input and we do not explicitly discover morphemes or morpheme segments. Although the visualized attention maps show interesting patterns of input segmentation (see Figure 2 ), the vertical bands are fading near the end-points. More specifically, it is not straight-forward to determine boundaries between the bands as the transition is done smoothly due the continuous nature of the attention function. For this reason, we could not obtain good segmentations by employing simple heuristics on the attention maps. Moreover, the purpose of our work is the novel adaptation of an object discovery method to text and developing an intrinsic evaluation framework for this line of work. Therefore, evaluating the obtained representations on downstream tasks, is not in the scope of our paper. However, since previous work has shown that the target representations used for our probing analysis are effective ways to capture meaning for downstream tasks, and since these results show that the information captured by our induced units are similar to the information in these representations, there is every reason to believe that the units discovered by the proposed model will also be effective meaning-bearing units for downstream tasks. We leave the empirical verification of this effectiveness to future work. As an initial experiment, we take our pretrained slot vectors and use them as text representations on the challenge set released by Hofmann et al. (2022) . The details of this experiment is provided in the Appendix B.5. For the ArXiv-L, on average we get 0.396, 0.409, 0.2732 F1 for slot-attn, stride=6, and character-based models, respectively, on Dev set, and 0.393, 0.394, 0.271 on Test set. These numbers indicate that our proposed models work much better than a character-based model on this challenging task and therefore, could be a useful replacement for tokenisation methods in some downstream tasks.

Reproducibility Statement

We have completely explained the details of our experiments in Appendix B. It includes the data we have used and how we have processed it, in addition to the models' parameters and training details. We will release our code upon acceptance. As for the datasets, we lowercased the text and retained the characters which occur more than 25 times in the corpus, following Kawakami et al. (2017) . We replace the low-frequent characters with an unknown placeholder. Table 17 shows the licenses of each dataset that we used in our experiments. We used the same train/validation/test splits as provided in the mentioned datasets.

B.2 MAIN MODEL SETTINGS

Table 18 shows the remaining list of hyperparameters in training the main model. We scheduled the λ parameter in the training loss to start with a low value of 2 × 10 -5 and exponentially increase it every 10 epochs until it reaches a certain limit. In particular, we schedule the λ to exponentially increase with ratio 2 for English until reaching 6.4e -4 and 1.5 for the rest of languages. More specifically, we stop the exponential increase after reaching 3e -4 for German and Spanish and 5e -4 for Finnish and Czech. We tried the stopping thresholds ranging from 3 × 10 -4 to 6 × 10 -4 . These scheduling values lead to having roughly as many slots as the average number BPE units per sentence. We also tried our model with statistic λ in the {10 -4 , [1, 3, 5, 6, 7, 8 ] × 10 -5 } which did not lead to stable results at training time. We tried 16, 32 and 64 slots in our experiments. We chose the number of slots to be half of the maximum sequence length (128) as this is a reasonable upperbound which also matches the maximum number of BPE or Morfessor units. We tried the transformer encoder with 4 and 6 layers but qualitatively did not find any improvements. We run the Slot Attention algorithm for T =1 iterations. We choose T =1 iterations for simplicity and efficiency, and because preliminary experiments showed no improvements with more iterations. We leave the investigation of how to get 

B.3 FORWARD PROBE SETTINGS

Our probing classifier consists of two fully connected layers with ReLU activation function in between the two layers. The hidden dimension of the classifier is the same as the slots' dimension, which is 128. We use the same datasets as our main model for training and testing the probes. We train BPE with vocabulary size of 5000 for all languages. For Morfessor, we use the pretrained model and consider the set of its outputs on the training data as our target representation. As for the morpheme targets, we take the morphemes for the words which were available in the linguistically annotated data (i.e., MorphoLex (Sánchez-Gutiérrez et al., 2018; Mailhot et al., 2020) ) and for the rest of the words we take Morfessor outputs as approximation of morphemes. For training the probing classifiers we take a batch size of 4, since the Hungarian matching algorithm requires a huge amount of memory. We train our classifiers for 200 epochs with Adam optimizer with learning rate of 1 × 10 -3 . input output una razón de su auge fue su aparente éxito en tratar enfermos por epidemias infecciosas . [BPE] una razón de su auge fue su aparente éxito en tratar enfermos por epi#as# inf#ocosasón [Morfessor] una razón de su auge fue su aparente éxito en tratar enfermcio por epidemias#s .rs außerdem wurde er zum besten spieler des turniers gewählt . [BPE] außerdem wurde er zum sten spieler des #ur#ierers gewähl# . [Morfessor] außerdem wurde er zutur besten spieler des turniers gewählt # We illustrate how forward and reverse probing are related to each other in Figure 7 . We first assign a one-hot vector to the tokens in the target set of units (i.e., R S , where S is the target set size). Then, we learn an embedding layer to map every one-hot vector into a continuous vector with dimension 128. Afterwards, we pass the embedded token into two fully connected layers with ReLU activations in between with the hidden dimension 128. We then predict the mean and the standard deviation of a Gaussian distribution. The dimension of the Gaussian distribution d is D slots = 128. We use the same matching between the target units as slots as in the forward probing. Namely, we do not run the matching algorithm for this experiment. As for the training objective we minimize the negative log-likelihood of the slot vector given this distribution, i.e., -log(p(m i |µ predicted , σ predicted )) which is equivalant to minimizing 1 2σ 2 predicted (m i -µ predicted ) 2log(σ predicted ). We train this model with Adam optimizer and the learning rate 1 × 10 -4 for 200 epochs. We report the best evaluation loss on test set.

B.5 ARXIV CHALLENGE SETTINGS

We used the challenge set released by Hofmann et al. (2022) which is supposed to serve as a benchmark for PLM tokenizers. The task is to classify each ArXiv title into its corresponding sub-area (within a category of 20) in the three subjects of CS, Maths, Physics. The dataset requires a challenging generalization from a small number of short training examples with highly complex language (Hofmann et al., 2022) . For this reason, we believe it would be a good benchmark for evaluating our models. Since our models output a set of vectors as the high-level representation of the character sequence, we design an attention-based classifier to perform the task. In particular, we learn a vector which acts as the query where the sequence representations are the keys and values. After applying attention over the sequence representations, we apply a 2 layer MLP with ReLU non-linearity to compute the scores for each sub-area. We freeze the representations we get from our trained models and only train the 



Figure 2: Attention of the decoder over slots (x-axis) for generating every target character (y-axis) after different training epochs, for target sequence "the red colour associated with lobsters only appears after cooking.".



(a) Attention of decoder (y-axis) over slots (x-axis).(b) Attention of slots (x-axis) over the input (yaxis) before the L0Drop.

Figure 4: Illustration of the Attention of Slots over the input vs the Attention of decoder over the slots for Finnish language.

(a) Attention of decoder (y-axis) over slots (x-axis).(b) Attention of slots (x-axis) over the input (yaxis) before the L0Drop.

Figure 5: Illustration of the Attention of Slots over the input vs the Attention of decoder over the slots for French language.

Figure 6: Attention of decoder over the stride-based model vectors (x-axis) while generating every character (y-axis). The target sentence is "the red colour associated with lobsters only appears after cooking.".

Figure 7: An illustration of forward and reverse probing.

Forward probing results on different languages. Note that human-annotated morphemes are only available for En and FR.

Reconstruction error on training and test set among different languages.

Input and output pairs from Spanish and German datasets. The predictions are sorted based on their matching target. The empty label is shown as '#' and wrong predictions are shown in red.

Stride-based models for English with BPE targets

Slot Attention based models for English with BPE targets

Stride-based models for English with Morpheme targets

Slot attention based models for English with Morpheme targets

Stride-based models for English with Morfessor targets

Slot attention-based models for English with Morfessor targets

Stride-based models for French with BPE targets

Slot attention-based models for French with BPE targets

Stride-based models for French with Morpheme targets

Slot attention-based models for French with Morpheme targets

Stride-based models for French with Morfessor targets

Slot attention-based models for French with Morfessor targets

List of packages and their versions.

A.1 SLOT INITIALIZATION ANALYSIS

Eq 2 untrained 0.71 0.09 0.17 0.71 0.11 0.19 (ours) slot-attn 0.96 0.73 0.82 0.95 0.74 0.83Eq 1 untrained 0.80 0.10 0.18 0.78 0.12 0.20 slot-attn 0.97 0.51 0.66 0.96 0.51 0.66We compare our proposed initialization with the original definition of Slot Attention, where slots are randomly initialized from a single shared distribution Table 2 shows the probing results of our model under different slot initializations. Having a separate µ for each slot (Eq 2) increases the capacity of the model and thus yields better results in comparison to sharing µ between the slots. The model especially achieves higher recall in this case, which implies better recovery of the BPE and Morfessor tokens.

A.2 RECONSTRUCTION RESULTS

In Table 3 , we report the reconstruction error among different languages. We compare the slot attention based model with modified objective, i.e., max(L 0 , r = 6), with the stride-based models (stride=6) to ensure that both models have the same average number of units. In all languages, slot attention based models are better capable of reconstructing the sequence. This indicates that the slot-attention vectors are better in capturing information about the structure and meaning of the input than the stride-based ones and therefore, providing better signals to the decoder to reconstruct the sequence.

A.3.1 ATTENTION OF SLOTS OVER THE INPUT

In Figures 4 and 5 , we illustrate the attention of slots over the input as well as the attention of decoder over the slots for the same input. The two attentions show similar patterns, but on the decoder side the attention weights are higher and therefore the patters are more visible. For this reason, we used the attention of decoder over the slots in section 4.2. Interestingly, in Figure 5b , there are some slots which are only attending to the space boundaries (the horizontal bands).

A.3.2 DECODER ATTENTION FOR THE STRIDE-BASED MODELS

Figure 6 visualizes the attention of decoder over the vectors of different stride-based models. As expected, the vertical bands are often of the same length and too much overlapping for larger strides. As a result, the bands do not correspond to meaningful units in the input in contrast to the slot-attention based model (see Figure 2c ).

A.4 FORWARD PROBING EXAMPLES

Table 4 shows two examples of the probing classifiers' predictions given the learned slots. As explained in 4.4, the model is quite precise in predicting non-empty labels.A.5 PROBING RESULTS' TABLES Tables 5 to 15 show the detailed results of the performance of forward and reverse probing tasks visualized in Figure 3 (Reverse reconstruction vs F1). n input denotes the input length which is 128 in our experiments. (Kawakami et al., 2017) https://aclanthology.org/ P17-1137/ MorphoLex (Sánchez-Gutiérrez et al., 2018; Mailhot et al., 2020) 

C INFRASTRUCTURE

We use PyTorch version 1.2.0 framework and Python version 3.6.9 for implementing our code. Table 19 shows the rest of the libraries we use. We run our code on a single GPU with model GTX1080ti and the operating system Debian10 (Buster) 64-bit. We use the same compute for all of our experiments including training the models and probes. The training time for the main model is five hours and for the probings is around 2 days. For reporting each of the results we run our algorithm once, since it would be too computationally expensive.

