INDUCING MEANINGFUL UNITS FROM CHARACTER SE-QUENCES WITH DYNAMIC CAPACITY SLOT ATTENTION

Abstract

Characters do not convey meaning, but sequences of characters do. We propose an unsupervised distributional method to learn the abstract meaning-bearing units in a sequence of characters. Rather than segmenting the sequence, our Dynamic Capacity Slot Attention model discovers continuous representations of the objects in the sequence, extending an architecture for object discovery in images. We train our model on different languages and evaluate the quality of the obtained representations with forward and reverse probing classifiers. These experiments show that our model succeeds in discovering units which are similar to those proposed previously in form, content and level of abstraction, and which show promise for capturing meaningful information at a higher level of abstraction.

1. INTRODUCTION

When we look at a complex scene, we perceive its constituent objects, and their properties such as shape and material. Similarly, what we perceive when we read a piece of text builds on the word-like units it is composed of, namely morphemes, the smallest meaningful units in a language. This paper investigates deep learning models which discover such meaningful units from the distribution of character sequences in natural text. In recent years, there has been an emerging interest in unsupervised object discovery in vision (Eslami et al., 2016; Greff et al., 2019; Engelcke et al., 2020) . The goal is to segment the scene into its objects without supervision and ideally obtain an object-centric representation of the scene. These representations should lead to better generalization to unknown scenes, and additionally should facilitate abstract reasoning over the image. Locatello et al. (2020) proposed a relatively simple and generic algorithm for discovering objects called Slot Attention, which iteratively finds a set of feature vectors (i.e., slots) which can bind to any object in the image through a form of attention. Inspired by this line of work in vision, our goal is to learn a set of abstract continuous representations of the objects in text. We adapt the Slot Attention module (Locatello et al., 2020) for this purpose, extending it for discovering the meaningful units in natural language character sequences. This makes our work closely related to unsupervised morphology learning (Creutz, 2003; Narasimhan et al., 2015; Eskander et al., 2020) . However, there are fundamental differences between our work and morphology learning. First, we learn a set of vector representations of text which are not explicitly tied to the text segments. Second, our model learns its representations by considering the entire input sentence, rather than individual space-delimited words. These properties of our induced representations makes our method more appropriate for inducing meaningful units as part of deep learning models. In particular, we integrate our unit discovery method on top of the the encoder in a Transformer auto-encoder (Vaswani et al., 2017) , as depicted in Figure 1 , and train it with an unsupervised sentence reconstruction objective. This setting differs from previous work on Slot Attention, which has been tested on synthetic image data with a limited number of objects (Locatello et al., 2020) . We propose several extentions to Slot Attention for the domain of real text data. We increase the capacity of the model to learn to distinguish a large number of textual units, and add the ability to learn how many units are needed to encode sequences with varying length and complexity. Thus, we refer to our method as Dynamic Capacity Slot Attention. Additionally, as a hand-coded alternative, we propose stride-based models and compare them empirically to our slot-attention based models. To evaluate the induced representations we both qualitatively inspect the model itself and quantitatively compare it to previously proposed representations. Visualisation of its attention patterns shows that the model has learned representations similar to the contiguous segmentations of traditional tokenization approaches. Trained probing classifiers show that the induced units capture similar abstractions to the units previously proposed in morphological annotations (MorphoLex (Sánchez-Gutiérrez et al., 2018; Mailhot et al., 2020) ) and tokenization methods (Morfessor (Virpioja et al., 2013) , BPE (Sennrich et al., 2016) ). We propose to do this probing evaluation in both directions, to compare both informativeness and abstractness. These evaluations show promising results in the ability of our models to discover units which capture meaningful information at a higher level of abstraction than characters. To summarize, our contributions are as follows: (i) We propose a novel model for learning meaning-bearing units from a sequence of characters (Section 2). (ii) We propose simple stridebased models which could serve as strong baselines for evaluating such unsupervised models (Section 2.4). (iii) We analyze the induced units by visualizing the attention maps of the decoder over the slots and observe the desired sparse and contiguous patterns (Section 4.2). (iv) We show that the induced units capture meaningful information at an appropriate level of abstraction by probing their equivalence to previously proposed meaningful units (Section 4.4).

2. APPROACH 2.1 PROBLEM FORMULATION

Given a sequence of N characters X = x 1 x 2 . . . x N , we want to find a set of meaning-bearing units (slots) M = {m 1 , . . . , m K }, which could best represent X in a higher level of abstraction. As an example, consider the sequence "she played basketball", where we expect our slots to represent something like the set of morphemes of the sequence, namely {she, play, -ed, basket, -ball}.

2.2. OVERVIEW

We learn our representations through encoding the input sequence into slots and then reconstructing the original sequence from them. Particularly, we use an auto-encoder structure where slots act as the bottleneck between the encoder and the decoder. Figure 1 shows an overview of our proposed model, Dynamic Capacity Slot Attention. First, we encode the input character sequence by a Transformer encoder (Vaswani et al., 2017) , which gives us one vector per character. Then, we apply our highercapacity version of a Slot Attention module (Locatello et al., 2020) over the encoded sequence, to learn the slots. Intuitively, Slot Attention will learn a soft clustering over the input where each cluster (or respectively slot) corresponds to a candidate meaningful unit in the sequence. To select which candidates are needed to represent the input, we integrate an L 0 regularizing layer, i.e., L 0 Drop layer (Zhang et al., 2021) , on top of the slots. Although the maximum number of slots is fixed during the course of training, this layer ensures that the model only uses as many slots as necessary for the particular input. This stops the model from converging to trivial solutions for short inputs, such as passing every character through a separate slot. Finally, the Transformer decoder reconstructs the input sequence autoregressively using attention over the set of slots.

2.3. MODEL

Encoder. We use Transformer encoder architecture for encoding our sequence (Vaswani et al., 2017) and obtain the representation X ′ = x ′ 1 x ′ 2 . . . x ′ N from our input sequence X. Slot Attention for text. After encoding the character sequence, we use our extended version of Slot Attention for discovering meaningful units of the input character sequence. Slot Attention is a



Figure1: The sketch of our model. First, the Transformer encoder encodes the sequence and then, Slot Attention computes the slot vectors (highlighted text). Next, the L 0 Drop layer dynamically prunes out the unnecessary slots. Finally, the decoder reconstructs the original sequence.

