Computer Laboratory

Diarmuid Ó Séaghdha

MPhil Project Suggestions 2013-14

Using Natural Language Processing to investigate how spoken sentences are processed in the human brain (with Anna Korhonen and Barry Devereux)

The study of language has been a central activity across many disciplines, including psychology, linguistics, computer science, and cognitive neuroscience. However, to date there has been surprisingly little cross-disciplinary research integrating the powerful analytical tools of computational linguistics into the study of how language is processed in the human brain. In this project, you will contribute to the emerging field of computational neurolinguistics, by developing quantitative measures of specific aspects of sentence processing, and then evaluating these measures against state-of-the-art neuroimaging data.

Understanding a spoken sentence has several different processing components, involving different regions of the brain. The incoming speech must be acoustically and phonetically processed, and lexical information for each word must be activated and integrated through syntactic computations to produce a final representation of the utterance. Moreover, because the speech signal unfolds over time, temporary ambiguities arise where multiple candidate syntactic representations are consistent with the currently available input, and these ambiguities must be subsequently resolved for successful understanding. For example, the phrase ,landing planes is locally syntactically ambiguous; it could be a noun phrase in which landing modifies planes (e.g. the full sentence might be landing planes are noisy) or it could be a gerundive clause, where planes is the object of landing (e.g. landing planes is difficult). Lexicalist accounts of sentence processing propose that lexico-syntactic knowledge associated with each word guides activation of candidate parses and is therefore influential in the ambiguity resolution process (Tyler and Marslen-Wilson, 1977; Marslen-Wilson et al., 1988; MacDonald et al., 1994). Such proposals are also supported by recent neuroimaging evidence (e.g. Shetreet et al., 2007; Tyler et al., 2013).

One important kind of lexico-syntactic knowledge is knowledge of selectional preference, the phenomenon by which verbs and other linguitic predicates are more likely to take certain semantic classes as arguments than others. For example, the direct object of drink is more likely to be a liquid than a human, but the subject of drink is more likely to be a human than a beverage. In this project, you will investigate how models of verb selectional preferences can be used to make predictions about the parsing preferences people that have when they hear locally ambiguous phrases. A variety of such models have been proposed in the NLP literature (e.g., Ó Séaghdha 2010, Ó Séaghdha and Korhonen, 2012) and the project will involve evaluating a representative selection. Does knowledge of selectional preferences create expectations regarding how phrases such as "landing planes" get disambiguated, and can we determine the neural correlates of such expectations in the brain? The explanatory power of the model you develop will be evaluated against high-temporal-resolution neuroimaging data acquired as human subjects listened to spoken sentences (Tyler et al 2013).


Identifying Causes and Effects

Reasoning about the causal links between events is a fundamental task in general artificial intelligence and in natural language understanding. As part of the SemEval 2012 exercise in semantic evaluation, Gordon et al. (2012) presented the Choice of Plausible Alternatives (COPA) shared task in which systems must select either the most likely of two possible outcomes caused by a trigger event or the most likely to two possible triggers leading to an outcome event. An example item is the following:

Premise
The man lost his balance on the ladder. What happened as a result?
Alternative 1
He fell off the ladder.
Alternative 2
He climbed up the ladder.

Only one system participated in the SemEval competition (Goodwin et al., 2012) and it has not received much attention since, so there is plenty left to do with this dataset! The project would involve comparing a number of methods in a controlled fashion, potentially including event chains in the manner of Chambers and Jurafsky (2008), insights from the psychology of causal reasoning (Griffiths and Tenenbaum, 2005) and supervised models of causality trained on data annotated for another SemEval shared task by Hendrickx et al. (2010).


Tracking Word Meaning Through Time

Modelling word meaning is a fundamental task in NLP and as such has received a lot of attention from researchers. The distributional approach to lexical semantics gives us a powerful way to model the meanings of words by observing their patterns of co-occurrence in text corpora (Turney and Pantel, 2010); in particular, probabilistic latent-variable models have been shown to produce robust and informative representations (Ó Séaghdha, 2010). It would also be interesting to track how a word changes its meaning over time; for example, the rise of computing and the Internet has given us new senses for many words including mouse, surf and tweet. Previous work has used corpora to track changes in topic prominence, lexical frequency and syntactic preferences over time (Blei and Lafferty, 2006; Hall et al, 2008; Michel et al, 2010, Danescu-Niculescu-Mizil et al, 2013), and machine learning researchers have developed models that learn diachronic variations of topic structure (Blei and Lafferty, 2006; Wang and McCallum, 2006). Applying insights and methods from this body of research to problems in lexical semantics has great potential to help us detect and analyse how and where words change their meanings. There has been some previous work in this area (e.g., Sagi et al, 2009) but it remains underexplored.

The goal of this project is to investigate methods for modelling how the distributional aspects of a word's meaning change over time in a chronologically partitioned corpus. Depending on the student's interests, this corpus may come from digitised books, scientific articles or social media. Evaluation experiments will involve quantitative and qualitative evaluations, potentially including user studies by domain experts.

This project would be a good fit for students interested in lexical semantics, lexicography, computational sociolinguistics, machine learning and/or digital humanities.