MESSAGENET: MESSAGE CLASSIFICATION USING NATURAL LANGUAGE PROCESSING AND META-DATA

Abstract

In this paper we propose a new Deep Learning (DL) approach for message classification. Our method is based on the state-of-the-art Natural Language Processing (NLP) building blocks, combined with a novel technique for infusing the metadata input that is typically available in messages such as the sender information, timestamps, attached image, audio, affiliations, and more. As we demonstrate throughout the paper, going beyond the mere text by leveraging all available channels in the message, could yield an improved representation and higher classification accuracy. To achieve message representation, each type of input is processed in a dedicated block in the neural network architecture that is suitable for the data type. Such an implementation enables training all blocks together simultaneously, and forming cross channels features in the network. We show in the Experiments Section that in some cases, message's meta-data holds an additional information that cannot be extracted just from the text, and when using this information we achieve better performance. Furthermore, we demonstrate that our multi-modality block approach outperforms other approaches for injecting the meta data to the the text classifier.

1. INTRODUCTION

Many real world applications require message classification and regression, such as handling spam emails Karim et al. (2020) , ticket routing Han et al. (2020) , article sentiment review Medhat et al. (2014) and more. Accurate message classification could improve critical scenarios such as in call centers (routing tickets based on topic) Han et al. (2020) , alert systems (flagging highly important alert messages) Gupta et al. (2012) , and categorizing incoming messages (automatically unclutter emails) Karim et al. (2020) ; Klimt & Yang (2004) . The main distinction between text and message classification is the availability of additional attributes, such as the sender information, timestamps, attached image, audio, affiliations, and more. New message classification contests often appear in the prominent platforms (i.e., Kaggle Kaggle), showing how this topic is sought after. There are already many data-sets to explore in this field, but no clear winner algorithm that fits all scenarios with high accuracy, efficiency and simplicity (in terms of implementation and interpretation). A notable advancement in the field of NLP is the attention based transformers architecture Vaswani et al. (2017) . This family of methods excels in finding local connections between words, and better understanding the meaning of a sentence. A leading example is the Bidirectional Encoder Representations from Transformers (BERT) Devlin et al. (2018) 2019), make such models accessible and easy to use as well as provide pre-trained versions. In addition, one can use transfer learning Pan & Yang (2009) to further train BERT on their on data, creating a tailored model for the specific task at hand. BERT, and often other transformer based models, are designed to handle text. They operate on the words of a given text by encoding them into tokens, and by the connections between the tokens they learn the context of sentences. This approach is limited, since sometimes more information can be extracted and used, not necessarily textual. Throughout this paper we refer to this information as meta-data to distinguish it from the main stream of textual content (though one may recognize it as the core data, depending on the application). For example, a meta-data could be the time stamp of when the text was written, sent, published, etc. Another example is the writer of the text, when dealing with a small list of writers of a corpus. There have been some attempts to incorporate these into BERT models, for example by assigning artificial tokens for writers or for temporal segments (token per month for example) Zhang et al. (2021) . This approach is limited since not all metadata entries are suitable for encoding by tokenization. In the example of temporal segments, more segments introduce more tokens, leading to large computational resources consumption, and less segments cause loss of information. Another approach is to concatenate the embeddings, created by the transformer module, with the outputs of an embedding module for the meta-data. In this approach, a transformer for the text is trained (using direct or transfer learning) on the text, and other separate modules (time series embedding, senders embeddings, etc.) are used to embed the meta-data. All the embeddings are then concatenated and used as inputs to a classification network. A drawback of this approach is that the internal network features are not trained from a combination of diffident input streams, and therefore avoid cross dependent features (e.g. the importance of an email is not only determined by its content, but also by who sent it, when, to whom else, attachments, etc.). To bridge these gaps, we implement a transformer based model that is able to train with both the text (transformer architecture) and meta-data. We create a new architecture of a blocks based network. Each block handles different kind of inputs. Splitting to blocks enables the flexibility to handle different kind of inputs. We present results of the method with a main block based on a transformer that handles the text, and an additional block that handles the pre-processed meta-data inputs individually. This method can be extended to support more complex blocks, such as an advanced DL model for images Wang et al. (2017) , a temporal analysis block to extract information from temporal metadata Ienco & Interdonato (2020), additional transformer blocks for multiple text inputs (for example, subject and body of an email), categorical data, and more. To demonstrate the performance of the method we run multiple experiments on publicly available data-sets to show the advantages of using the block architecture, and compare them to the transformer benchmark (BERT), Random Forest (RF) classifier, and Multi-Layer Perceptron (MLP) networks. We achieve competitive results, and in most cases lead those benchmarks, showcasing that there is much to extract from the meta-data compared to just using text for classification tasks.

2. RELATED WORK

Natural language processing tasks. The publication of BERT Devlin et al. (2018) has been a turning point in the text classification domain. The authors demonstrated high accuracy on complicated tasks such as question and answer, named entity recognition, and textual entailment Wang et al. (2019) . Since then, many authors investigated improved architectures and variations such as RoBERTa Liu et al. (2019) , ALBERT Lan et al. (2019 ), DistilBERT Sanh et al. (2019) , and more. Some focus on better performance on the benchmark tasks, and some create lighter versions of the model that reduce the computational demands while preserving competitive accuracy. Other propositions, like XLNet Yang et al. ( 2019) and GPT-3 Brown et al. (2020) , introduce competing architectures to BERT (also using transformers). The benchmarks for these models are commonly GLUE, SuperGLUE Wang et al. ( 2019), SQuAD 2.0 Rajpurkar et al. ( 2018), and more. Text classification is a less common benchmark, but the models can be used for this task as shown in this paper. Accessibility of transformers. Another contributing factor to the growing popularity of transformers is the variety of open-source code bases that make it easy for data-scientist to experiment with different architectures and then use it in their applications. The Huggingface transformers package Wolf et al. ( 2020) is a Python library that can be used to train and fine-tune models, with a large variety of base models to choose from, and straightforward implementation. The GPT-3 Brown et al. (2020) has been published as open source and, similar to several other implementations, offers a convenient application programming interface (API). We mention that many libraries that do not use machine-learning for text classification exist such as NLTK Bird et al. (2009 ), spaCy Honnibal & Montani (2017) , and more. These are also easily accessible and offer advanced NLP feature extraction and other text analysis tools. Text classification. There are many tasks in text classification, and each may be considered as a field of study. A popular one is sentiment analysis, aiming to classify texts as positive or negative.

