Computer Laboratory

Diarmuid Ó Séaghdha

MPhil Project Suggestions 2012-13

Update 23/11/12: I am not taking any more students for this year - sorry!


The Language of Influence in Social Media (with Daniele Quercia) Update: This project is now taken

While macro-level studies of Twitter have concluded that "Twitter is not a social network but a news media" (Kwak et al., 2010), micro-level analyses of users and user-user interactions suggest that the Twittersphere functions as a set of real communities according to well-established sociological definitions of community (Cheng et al., 2011; Quercia et al., 2011). Twitter facilitates conversations between users, each of whom has a particular degree of status online (which may or may not reflect real-life status). This project would study social power relationships on Twitter, with a focus on linguistic indicators. Notions of influence on Twitter have been studied before (Liu et al, 2010; Danescu-Niculescu-Mizil et al, 2011) but there are still many potential aspects to investigate. Recently, Bramsen et al. (2011) looked at identifying power hierarchies as manifested in email data; here the sociolinguistic motivation would be similar but the problem would be quite different (and novel!). The NLP techniques involved would start with basic lexical, topic-model and sentiment analysis, but could extend to sophisticated dialogue act modelling (Ritter et al., 2010) and discourse effects such as accommodation (Danescu-Niculescu-Mizil et al, 2011; Danescu-Niculescu-Mizil et al, 2012) and coherence.


Multilingual models for lexical semantics

Lexical semantics is the study of word meaning; NLP research in this area typically deals with learning about concepts and how they interact through statistical analysis of text. The goal of this project is to use multilingual models that learn semantics from multiple languages at once, exploiting regularities that cross languages. One potential area of research for the project is selectional preference learning.

Selectional preference learning is a form of lexical acquisition where the aim is to model typical relationships between predicates and arguments; for example, the direct object slot of the verb "eat" has a preference for words like "pizza" and "stew" (the semantic class of foodstuffs), while the subject slot of "eat" has a preference for animate beings such as "man" and "dog". Selectional preference learning has a long history in NLP; recently, probabilistic topic models have been proposed as a powerful modelling framework, giving state-of-the-art results on a variety of tasks (Ó Séaghdha 2010, Ó Séaghdha and Korhonen 2011). The goal of this project is to extend prior work to peform joint modelling of selectional preferences in multiple languages, with the optimistic goal of improving overall model quality and the realistic goal of compensating in languages where data may be more sparse than in English. There is a large body of work on multilingual modelling for tasks such as part-of-speech learning (for example Snyder et al., 2008); prior work on multilingual semantic modelling is less common, but Peirsman and Padó (2010) have shown that multilingual selectional preference learning is possible. The goal of this project would be to build multilingual topic models for preference learning, possibly using frameworks similar to Mimno et al. (2009) or Boyd-Graber and Blei (2009). In addition to intrinsic measurements of model quality, it may be interesting to evaluate using the dataset from the SemEval-2 Cross-Lingual Lexical Substitution Task (Mihalcea et al., 2010). This project should be of interest to students who wish to develop new machine learning models for application to cutting-edge problems in lexical semantics.


Identifying Causes and Effects

Reasoning about the causal links between events is a fundamental task in general artificial intelligence and in natural language understanding. As part of the SemEval 2012 exercise in semantic evaluation, Gordon et al. (2012) presented the Choice of Plausible Alternatives (COPA) shared task in which systems must select either the most likely of two possible outcomes caused by a trigger event or the most likely to two possible triggers leading to an outcome event. An example item is the following:

Premise
The man lost his balance on the ladder. What happened as a result?
Alternative 1
He fell off the ladder.
Alternative 2
He climbed up the ladder.

Only one system participated in the SemEval competition (Goodwin et al., 2012), so there is plenty left to do with this dataset! The project would involve comparing a number of methods in a controlled fashion, potentially including event chains in the manner of Chambers and Jurafsky (2008), insights from the psychology of causal reasoning (Griffiths and Tenenbaum, 2005) and supervised models of causality trained on data annotated for another SemEval shared task by Hendrickx et al. (2010).