Computer Laboratory

Diarmuid Ó Séaghdha

MPhil Project Suggestions 2011-12

Modelling the language of politics

The use of language processing methods is currently a "hot topic" in political science (Grimmer and Stewart 2011). Computational analysis is frequently used to model the language of political discourse, both among professional politicians and among members of the public. For example, Fader et al. (2007) use lexical similarity and graph techniques to estimate the influence of various members of the US Senate, Monroe et al. (2008) investigate which lexical features distinguish politicians of opposing parties and Gerrish and Blei (2011) correlate the language used by politicians with their voting behaviour. On the other hand, Pennachiotti and Popescu (2011) focus on users of Twitter, showing that it is possible to predict political affiliation from microblogging and social activity. There are many opportunities for NLP researchers to contribute in this area; some possibilities that could lead to a very nice project are:

  1. Modelling political language in the UK: almost all work in political textual analysis to date has focused on the US political landscape. As well as being of local interest, analogous studies of UK politics would facilitate comparison of findings across political systems. For example, what happens in a three-party system?
  2. Identifying controversial or polarising topics: while previous research has modelled how political opponents differ in simple lexical choice, the idea here would be to model how opponents discuss the same issue in different terms or with different sentiment. A possible extension would be to investigate how political polarisation in social media (An et al., 2011) differs across topics.

In all cases, data could be collected from online parliamentary proceedings (for the UK, Hansard Online) or from Twitter profiles of politicians and politically engaged individuals. The NLP techniques involved may be based on lexical associations, supervised learning or on topic modelling.

Identifying non-compositional compound nouns

Compounding, by which new lexical items are created by combining words, is a frequent process in many languages. In English, the most common form of compounding is noun-noun compounding; compound nouns such as "olive oil", "tax cut" and "taxi driver" are very frequent and novel compounds are frequently added to the language. Interpreting the relational semantics of compounds is a difficult problem that has received much attention in NLP (e.g., Ó Séaghdha 2008, Tratz and Hovy 2010); this project will focus on a separate but related problem, that of identifying compounds with non-compositional meaning. The compound examples given above are all compositional, in the sense that their meanings are relatively predictable given knowledge of their constituents' meanings. On the other hand, non-compositional examples like "zebra crossing" and "crocodile tears" are not predictable from knowledge about zebras and crocodiles; one has to know the specific meaning of each compound. Compositionality detection is also a well-studied problem (Baldwin et al. 2003, Cook et al. 2009) and was the subject of a shared task this year (Biemann and Giesbrecht 2011), but research in this area has mostly focused on other forms of multiword expressions such as verb-particle constructions. Reddy et al. (2011) have recently released a large dataset of compounds with fine-grained compositionality judgements; a second compositionality dataset could be extracted from the general compound data collected by Ó Séaghdha (2008). This project would build on Reddy et al.'s approach, borrowing ideas from work on discriminative learning of compositionality (Bergsma et al. 2010) and from paraphrase-based modelling of compound semantics (Nakov 2007).

Multilingual selectional preference learning

Selectional preference learning is a form of lexical acquisition where the aim is to model typical relationships between predicates and arguments; for example, the direct object slot of the verb "eat" has a preference for words like "pizza" and "stew" (the semantic class of foodstuffs), while the subject slot of "eat" has a preference for animate beings such as "man" and "dog". Selectional preference learning has a long history in NLP; recently, probabilistic topic models have been proposed as a powerful modelling framework, giving state-of-the-art results on a variety of tasks (Ó Séaghdha 2010, Ó Séaghdha and Korhonen 2011). The goal of this project is to extend prior work to peform joint modelling of selectional preferences in multiple languages, with the optimistic goal of improving overall model quality and the realistic goal of compensating in languages where data may be more sparse than in English. There is a large body of work on multilingual modelling for tasks such as part-of-speech learning (for example Snyder et al., 2008); prior work on multilingual semantic modelling is less common, but Peirsman and Padó (2010) have shown that multilingual selectional preference learning is possible. The goal of this project would be to build multilingual topic models for preference learning, possibly using frameworks similar to Mimno et al. (2009) or Boyd-Graber and Blei (2009). In addition to intrinsic measurements of model quality, it may be interesting to evaluate using the dataset from the SemEval-2 Cross-Lingual Lexical Substitution Task (Mihalcea et al., 2010). This project should be of interest to students who wish to develop new machine learning models for application to cutting-edge problems in lexical semantics.

The Language of Power and Influence on Twitter (with Daniele Quercia)

While macro-level studies of Twitter have concluded that "Twitter is not a social network but a news media" (Kwak et al., 2010), micro-level analyses of users and user-user interactions suggest that the Twittersphere functions as a set of real communities according to well-established sociological definitions of community (Cheng et al., 2011, Quercia et al., 2011). This project would study social power relationships on Twitter, with a focus on linguistic indicators. Notions of influence on Twitter have been studied before (Liu et al, 2010) but there are still many potential aspects to investigate. Recently, Bramsen et al. (2011) looked at identifying power hierarchies as manifested in email data; here the sociolinguistic motivation would be similar but the problem would be quite different (and novel!). The NLP techniques involved would start with basic lexical and topic-model analysis, but could extend to sophisticated dialogue modelling (Ritter et al., 2010).

Linguistic Analysis for Computational Social Science (with Daniele Quercia)

"Computational social science" (Lazer et al., 2009) is a new discipline that aims at using large archives of naturalistically-created behavioural data (of, for example, emails, tweets, Facebook contacts) to answer social science questions. Collecting data on actual behaviour is seen by many as the gold standard of social science, and digital archives offer the possibility of doing so. Emails have been used to understand how corporations work (Kleinbaum et al., 2008), and online dating sites to understand racial preferences in dating (Feliciano et al., 2009). However, in using web services, one faces few challenges, and this project is about one specific challenge: how to use Twitter to understand people beliefs about a variety of societal issues. These issues include privacy (Nippert-Eng 2010), serendipity (Pariser 2010), gender issues, and nationalism. One way of interpreting what people say on Twitter is to analyse the use of language on tweets. The approach taken in this project will be to perform automatic analysis of issue-specific Twitter text using NLP and machine learning techniques including topic modelling (Blei et al., 2003), statistical association measures (Evert 2005) and supervised prediction. The results of this analysis will be compared to those obtained by previous social science studies using a grounded theory approach (Rubin et al., 2011).