INTEGRATING LINGUISTIC KNOWLEDGE INTO DNNS: APPLICATION TO ONLINE GROOMING DETECTION Anonymous

Abstract

Online grooming (OG) of children is a pervasive issue in an increasingly interconnected world. We explore various complementary methods to incorporate Corpus Linguistics (CL) knowledge into accurate and interpretable Deep Learning (DL) models. They provide an implicit text normalisation that adapts embedding spaces to the groomers' usage of language, and they focus the DNN's attention onto the expressions of OG strategies. We apply these integrations to two architecture types and improve on the state-of-the-art on a new OG corpus.

1. INTRODUCTION

Online grooming (OG) is a communicative process of entrapment in which an adult lures a minor into taking part in sexual activities online and, at times, offline (Lorenzo-Dus et al., 2016; Chiang & Grant, 2019) . Our aim is to detect instances of OG. This is achieved through binary classification of whole conversations into OG (positive class) or neutral (negative class). This classification requires the ability to capture subtleties in the language used by groomers. Corpus Linguistic (CL) analysis provides a detailed characterisation of language in large textual datasets (McEnery & Wilson, 2003; Sinclair, 1991) . We argue that, when integrated into ML models, the products of CL analysis may allow a better capture of language subtleties, while simplifying and guiding the learning task. We consider two types of CL products and explore strategies for their integration into several stages of DNNs. Moreover, we show that CL knowledge may help law enforcement in interpreting the ML decision process, towards the production of evidences for potential prosecution. Our text heavily uses slang and sms-style writing, as many real-world Natural Language Processing (NLP) tasks for chat logs. Text normalisation methods were proposed to reduce variance in word choice and/or spelling and simplify learning, e.g. (Mansfield et al., 2019) for sms-style writing. However, they do not account for the final analysis goal and may discard some informative variance, e.g. the use of certain forms of slang possibly indicative of a user category. CL analysis provides with the preferred usage of spelling variants or synonyms. We propose to use this domain knowledge to selectively normalise chat logs while preserving the informative variance for the classification task. As demonstrated by the CL analysis in (Lorenzo-Dus et al., 2016) , the theme and immediate purpose of groomer messages may vary throughout the conversation, in order to achieve the overarching goal of entrapping the victims. Groomers use a series of inter-connected "sub-goals", referred to as OG processes here, namely gaining the child's trust, planning activities, building a relationship, isolating them emotionally and physically from his/her support network, checking their level of compliance, introducing sexual content and trying to secure a meeting off-line. The language used within these processes is not always sexually explicit, which makes their detection more challenging. However, CL analysis additionally flags some contexts associated to the OG processes, in the form of word collocations (i.e. words that occur within a same window of 7 words) that tend to occur more frequently in, and therefore can be associated with, OG processes. We propose to exploit the relations between the OG processes and their overarching goal of OG to improve the final OG classification. We use the CL identified context windows to guide the learning of our DNN. Our main contributions are: 1) We explore different strategies for integrating CL knowledge into DNNs. They are applied to two architecture types and demonstrated on OG detection, but may generalise to other NLP applications that involve digital language and/or complex conversational strategies. 2) The principle and several implementations of selectively normalising text through modifying a word embedding in support to classification. 3) The decomposition of conversation

