Native Language Identification (NLI) is an NLP task concerned with detection of an anonymous writer's native or first language (L1) based on the characteristics of their writing in their second (L2) or further language. This task has recently attracted much attention – see the two recent shared tasks on NLI here and here.
Most previous research on NLI (including the shared tasks) has focused on identification of L1 in language-learning contexts and has only used learner corpora (e.g., CLC) in their experiments. Such corpora contain essays from learners of English as an L2; they are typically written on the predefined set of topics and are of reasonable and comparable length, giving researchers appropriate support for estimating L1s. These traits are invaluable for further applications in language-learning contexts, such as writing improvement, L1-specific feedback, and improved error detection/correction.
At the same time, English is the predominant language of the Web, but not the L1 of most authors. Author profiling on the Web is useful for a number of applications: for instance, L1-detection can assist us in finding fake news, opinion spam detection, and other fraud attempts, by discovering discrepancies between the purported and detected L1 of the author. While previous studies on NLI that relied on the use of learner data may provide us with useful clues and techniques, NLI performance results in language-learning contexts may not be reliable estimates of the performance on social media texts and user reviews.
This project will aim to take NLI studies "into the wild" and look into the challenges of NLI on real-world data. In particular: (i) Previous research considered a number of grammatical and syntactic features in text. Social media data is challenging for standard NLP tools (Eisenstein, 2013; Kong et al., 2014), but even when grammatical and syntactic analysis is available, the above-mentioned properties of social media data suggest that grammatical and syntactic features derived from the training data may not generalise to social media data. (ii) Another promising line of research relied on errors and misspellings in L2 writing (Kochmar, 2011). These types of features are not readily available for social media texts. Moreover, it is typical for tweets to contain non-conventionally spelled words and incomplete sentences, making it harder to distinguish misspelling from deliberate variation (Han & Baldwin, 2011).
This project will use a dataset of user reviews extracted from the Trustpilot website and Twitter social media platform. The texts are written by speakers of 7 different L1s in English as an L2. As a starting point, student(s) undertaking this project may consider running generalisation experiments applying models trained in learner context to social media data (for the baseline model), feature engineering for social media data (using more linguistically informed approach) and NLI experiments using domain adaptation (using more machine learning-oriented approach).
Trustpilot and Twitter datasets will be provided.
Eisenstein (2013), What to do about bad language on the internet
Kong et al. (2014), A Dependency Parser for Tweets
Kochmar (2011), Identification of a Writer’s Native Language by Error Analysis
Han & Baldwin (2011), Lexical Normalisation of Short Text Messages: Makn Sens a #twitter