Diarmuid Ó Séaghdha
MPhil Project Suggestions 2011-12
Modelling the language of politics
The use of language processing methods is currently a "hot topic" in political science (Grimmer and Stewart 2011). Computational analysis is frequently used to model the language of political discourse, both among professional politicians and among members of the public. For example, Fader et al. (2007) use lexical similarity and graph techniques to estimate the influence of various members of the US Senate, Monroe et al. (2008) investigate which lexical features distinguish politicians of opposing parties and Gerrish and Blei (2011) correlate the language used by politicians with their voting behaviour. On the other hand, Pennachiotti and Popescu (2011) focus on users of Twitter, showing that it is possible to predict political affiliation from microblogging and social activity. There are many opportunities for NLP researchers to contribute in this area; some possibilities that could lead to a very nice project are:
- Modelling political language in the UK: almost all work in political textual analysis to date has focused on the US political landscape. As well as being of local interest, analogous studies of UK politics would facilitate comparison of findings across political systems. For example, what happens in a three-party system?
- Identifying controversial or polarising topics: while previous research has modelled how political opponents differ in simple lexical choice, the idea here would be to model how opponents discuss the same issue in different terms or with different sentiment. A possible extension would be to investigate how political polarisation in social media (An et al., 2011) differs across topics.
In all cases, data could be collected from online parliamentary proceedings (for the UK, Hansard Online) or from Twitter profiles of politicians and politically engaged individuals. The NLP techniques involved may be based on lexical associations, supervised learning or on topic modelling.
- Jisun An, Meeyoung Cha, Krishna Gummadi and Jon Crowcroft. 2011. Media landscape in Twitter: A world of new conventions and political diversity. In Proceedings of ICWSM 2011.
- Anthony Fader, Dragomir Radev, Michael H. Crespin, Burt L. Monroe, Kevin M. Quinn and Michael Colaresi. 2007. MavenRank: Identifying Influential Members of the US Senate Using Lexical Centrality. In Proceedings of EMNLP 2007.
- Sean Gerrish and David Blei. 2011. Predicting Legislative Roll Calls from Text. In Proceedings of ICML 2011.
- Justin Grimmer and Brandom M. Stewart. 2011. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.
- Burt L. Monroe, Michael P. Colaresi and Kevin M. Quinn. 2008. Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict. Political Analysis 16(4): 372-403.
- Marco Pennachiotti and Ana-Maria Popescu. 2011. Democrats, Republicans and Starbucks Afficionados: User Classification in Twitter. In Proceedings of KDD 2011.
Identifying non-compositional compound nouns
Compounding, by which new lexical items are created by combining words, is a frequent process in many languages. In English, the most common form of compounding is noun-noun compounding; compound nouns such as "olive oil", "tax cut" and "taxi driver" are very frequent and novel compounds are frequently added to the language. Interpreting the relational semantics of compounds is a difficult problem that has received much attention in NLP (e.g., Ó Séaghdha 2008, Tratz and Hovy 2010); this project will focus on a separate but related problem, that of identifying compounds with non-compositional meaning. The compound examples given above are all compositional, in the sense that their meanings are relatively predictable given knowledge of their constituents' meanings. On the other hand, non-compositional examples like "zebra crossing" and "crocodile tears" are not predictable from knowledge about zebras and crocodiles; one has to know the specific meaning of each compound. Compositionality detection is also a well-studied problem (Baldwin et al. 2003, Cook et al. 2009) and was the subject of a shared task this year (Biemann and Giesbrecht 2011), but research in this area has mostly focused on other forms of multiword expressions such as verb-particle constructions. Reddy et al. (2011) have recently released a large dataset of compounds with fine-grained compositionality judgements; a second compositionality dataset could be extracted from the general compound data collected by Ó Séaghdha (2008). This project would build on Reddy et al.'s approach, borrowing ideas from work on discriminative learning of compositionality (Bergsma et al. 2010) and from paraphrase-based modelling of compound semantics (Nakov 2007).
- Timothy Baldwin, Colin Bannard, Takaaki Tanaka and Dominic Widdows. 2003. An Empirical Model of Multiword Expression Decomposability. In Proceedings of the ACL-03 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment.
- Chris Biemann and Eugenie Giesbrecht. 2011. Distributional Semantics and Compositionality 2011: Shared Task Description and Results. In Proceedings of the ACL-11 Workshop on Distributional Semantics and Compositionality.
- Shane Bergsma, Aditya Bhargava, Hua He and Grzegorz Kondrak. 2010. Predicting the Semantic Compositionality of Prefix Verbs. In Proceedings of EMNLP 2010.
- Paul Cook, Afsaneh Fazly and Suzanne Stevenson. 2009. Unsupervised Type and Token Identification of Idiomatic Expressions. Computational Linguistics 35(1):61-103.
- Preslav Nakov. 2007. Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics. PhD Thesis, University of California at Berkeley.
- Diarmuid Ó Séaghdha. 2008. Learning Compound Noun Semantics. PhD Thesis, University of Cambridge.
- Stephen Tratz and Eduard Hovy. 2010. A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation. In Proceedings of ACL 2010.
- Siva Reddy, Diana McCarthy and Suresh Manandhar. 2011. An Empirical Study on Compositionality in Compound Nouns. In Proceedings of IJCNLP 2011.
Multilingual selectional preference learning
Selectional preference learning is a form of lexical acquisition where the aim is to model typical relationships between predicates and arguments; for example, the direct object slot of the verb "eat" has a preference for words like "pizza" and "stew" (the semantic class of foodstuffs), while the subject slot of "eat" has a preference for animate beings such as "man" and "dog". Selectional preference learning has a long history in NLP; recently, probabilistic topic models have been proposed as a powerful modelling framework, giving state-of-the-art results on a variety of tasks (Ó Séaghdha 2010, Ó Séaghdha and Korhonen 2011). The goal of this project is to extend prior work to peform joint modelling of selectional preferences in multiple languages, with the optimistic goal of improving overall model quality and the realistic goal of compensating in languages where data may be more sparse than in English. There is a large body of work on multilingual modelling for tasks such as part-of-speech learning (for example Snyder et al., 2008); prior work on multilingual semantic modelling is less common, but Peirsman and Padó (2010) have shown that multilingual selectional preference learning is possible. The goal of this project would be to build multilingual topic models for preference learning, possibly using frameworks similar to Mimno et al. (2009) or Boyd-Graber and Blei (2009). In addition to intrinsic measurements of model quality, it may be interesting to evaluate using the dataset from the SemEval-2 Cross-Lingual Lexical Substitution Task (Mihalcea et al., 2010). This project should be of interest to students who wish to develop new machine learning models for application to cutting-edge problems in lexical semantics.
- Jordan Boyd-Graber and David Blei. 2009. Multilingual Topic Models for Unaligned Text. In Proceedings of UAI 2009.
- Rada Mihalcea, Ravi Sinha and Diana McCarthy. 2010. SemEval-2010 Task 2: Cross-lingual lexical substitution. In Proceedings of SemEval 2010.
- David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith and Andrew McCallum. 2009. Polylingual Topic Models. In Proceedings of EMNLP 2009.
- Diarmuid Ó Séaghdha. 2010. Latent variable models of selectional preference. In Proceedings of ACL 2010.
- Diarmuid Ó Séaghdha and Anna Korhonen. 2011. Probabilistic models of similarity in syntactic context. In Proceedings of EMNLP 2011.
- Yves Peirsman and Sebastian Padó. 2010. Cross-lingual Induction of Selectional Preferences with Bilingual Vector Spaces. In Proceedings of NAACL 2010.
- Benjamin Snyder, Tahira Naseem, Jacob Eisenstein and Regina Barzilay. 2008. Unsupervised Multilingual Learning for POS Tagging. In Proceedings of EMNLP 2008.
The Language of Power and Influence on Twitter (with Daniele Quercia)
While macro-level studies of Twitter have concluded that "Twitter is not a social network but a news media" (Kwak et al., 2010), micro-level analyses of users and user-user interactions suggest that the Twittersphere functions as a set of real communities according to well-established sociological definitions of community (Cheng et al., 2011, Quercia et al., 2011). This project would study social power relationships on Twitter, with a focus on linguistic indicators. Notions of influence on Twitter have been studied before (Liu et al, 2010) but there are still many potential aspects to investigate. Recently, Bramsen et al. (2011) looked at identifying power hierarchies as manifested in email data; here the sociolinguistic motivation would be similar but the problem would be quite different (and novel!). The NLP techniques involved would start with basic lexical and topic-model analysis, but could extend to sophisticated dialogue modelling (Ritter et al., 2010).
- Philip Bramsen, Martha Escobar-Molano, Ami Patel and Rafael Alonso. 2011. Extracting Social Power Relationships from Natural Language. In Proceedings of ACL 2011.
- Justin Cheng, Daniel Romero, Brendan Meeder and Jon Kleinberg. 2011. Predicting Reciprocity in Social Networks. In Proceedings of SocialCom 2011.
- Haewoon Kwak, Changhyun Lee, Hosung Park and Sue Moon. 2010. What is Twitter, a Social Network or a News Media?. In Proceedings of WWW 2010.
- Lu Liu, Jie Tang, Jiawei Han, Meng, Jiang and Shiqiang Yang. 2010. Mining Topic-level Influence in Heterogeneous Networks. In Proceedings of CIKM 2010.
- Daniele Quercia, Jonathan Ellis, Licia Capra and Jon Crowcroft. 2011. In the Mood for Being Influential on Twitter. In Proceedings of SocialCom 2011.
- Alan Ritter, Colin Cherry and Bill Dolan. 2010. Unsupervised Modeling of Twitter Conversations. In Proceedings of NAACL 2010.
Linguistic Analysis for Computational Social Science (with Daniele Quercia)
"Computational social science" (Lazer et al., 2009) is a new discipline that aims at using large archives of naturalistically-created behavioural data (of, for example, emails, tweets, Facebook contacts) to answer social science questions. Collecting data on actual behaviour is seen by many as the gold standard of social science, and digital archives offer the possibility of doing so. Emails have been used to understand how corporations work (Kleinbaum et al., 2008), and online dating sites to understand racial preferences in dating (Feliciano et al., 2009). However, in using web services, one faces few challenges, and this project is about one specific challenge: how to use Twitter to understand people beliefs about a variety of societal issues. These issues include privacy (Nippert-Eng 2010), serendipity (Pariser 2010), gender issues, and nationalism. One way of interpreting what people say on Twitter is to analyse the use of language on tweets. The approach taken in this project will be to perform automatic analysis of issue-specific Twitter text using NLP and machine learning techniques including topic modelling (Blei et al., 2003), statistical association measures (Evert 2005) and supervised prediction. The results of this analysis will be compared to those obtained by previous social science studies using a grounded theory approach (Rubin et al., 2011).
- David M. Blei, Andrew Y. Ng and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3:993-1022.
- Stefan Evert. 2005. The Statistics of Word Co-occurrences: Word Pairs and Collocations. PhD Thesis, University of Stuttgart.
- Cynthia Feliciano, Belinda Robnett and Golnaz Komaie. 2009. Gendered racial exclusion among white internet daters. Social Science Research 38(1):39-54.
- Adam M. Kleinbaum, Toby E. Stuart and Michael L. Tushman. 2008. Communication (and Coordination?) in a Modern, Complex Organization. Harvard Business School Working Paper.
- David Lazer et al. 2009. Computational Social Science. Science 323(5915): 721-723.
- Christena E. Nippert-Eng. 2010. Islands of Privacy. University of Chicago Press.
- Eli Pariser. 2011. The Filter Bubble: What the Internet Is Hiding from You. Penguin.
- Victoria L. Rubin, Jacquelyn Burkell and Anabel Quan-Haase. 2011. Facets of serendipity in everyday chance encounters: a grounded theory approach to blog analysis. Information Research 16(3).
- Joshua R. Tyler, Dennis M. Wilkinson and Bernardo A. Huberman. 2005. E-Mail as Spectroscopy: Automated Discovery of Community Structure within Organizations. The Information Society 21(2):143-153.