Using Natural Language Processing to investigate how spoken sentences are processed in the human brain

  • Proposer:Anna Korhonen, Barry Devereux and Diarmuid Ó Séaghdha
  • Supervisor: Anna Korhonen, Barry Devereux and Diarmuid Ó Séaghdha
  • Special Resources: None

    Description

    The study of language has been a central activity across many disciplines, including psychology, linguistics, computer science, and cognitive neuroscience. However, to date there has been surprisingly little cross-disciplinary research integrating the powerful analytical tools of computational linguistics into the study of how language is processed in the human brain. In this project, you will contribute to the emerging field of computational neurolinguistics, by developing quantitative measures of specific aspects of sentence processing, and then evaluating these measures against state-of-the-art neuroimaging data.

    Understanding a spoken sentence has several different processing components, involving different regions of the brain. The incoming speech must be acoustically and phonetically processed, and lexical information for each word must be activated and integrated through syntactic computations to produce a final representation of the utterance. Moreover, because the speech signal unfolds over time, temporary ambiguities arise where multiple candidate syntactic representations are consistent with the currently available input, and these ambiguities must be subsequently resolved for successful understanding. For example, the phrase ,landing planes is locally syntactically ambiguous; it could be a noun phrase in which landing modifies planes (e.g. the full sentence might be landing planes are noisy) or it could be a gerundive clause, where planes is the object of landing (e.g. landing planes is difficult). Lexicalist accounts of sentence processing propose that lexico-syntactic knowledge associated with each word guides activation of candidate parses and is therefore influential in the ambiguity resolution process (Tyler and Marslen-Wilson, 1977; Marslen-Wilson et al., 1988; MacDonald et al., 1994). Such proposals are also supported by recent neuroimaging evidence (e.g. Shetreet et al., 2007; Tyler et al., 2013).

    One important kind of lexico-syntactic knowledge is knowledge of selectional preference, the phenomenon by which verbs and other linguitic predicates are more likely to take certain semantic classes as arguments than others. For example, the direct object of drink is more likely to be a liquid than a human, but the subject of drink is more likely to be a human than a beverage. In this project, you will investigate how models of verb selectional preferences can be used to make predictions about the parsing preferences people that have when they hear locally ambiguous phrases. A variety of such models have been proposed in the NLP literature (e.g., Ó Séaghdha 2010, Ó Séaghdha and Korhonen, 2012) and the project will involve evaluating a representative selection. Does knowledge of selectional preferences create expectations regarding how phrases such as "landing planes" get disambiguated, and can we determine the neural correlates of such expectations in the brain? The explanatory power of the model you develop will be evaluated against high-temporal-resolution neuroimaging data acquired as human subjects listened to spoken sentences (Tyler et al 2013).

    Programming language: preferably Python, Java and/or MATLAB.

    References:

    Korhonen A, Krymolowski Y, Briscoe T (2006) A Large Subcategorization Lexicon for Natural Language Processing Applications. In: Proceedings of the 5th International Conference on Language Resources and Evaluation. Genova, Italy.

    MacDonald MC, Pearlmutter NJ, Seidenberg MS (1994) Lexical Nature of Syntactic Ambiguity Resolution. Psychol Rev 101:676-703.

    Marslen-Wilson W, Brown CM, Tyler LK (1988) Lexical representations in spoken language comprehension. Lang Cogn Process 3:1.

    McCarthy D (2001) Lexical Acquisition at the Syntax-Semantics Interface: Diathesis Alternations, Subcategorization Frames and Selectional Preferences. Available at: http://www.dianamccarthy.co.uk/papers/finalthesis.pdf.

    Ó Séaghdha D (2010) Latent variable models of selectional preference. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-10). Uppsala, Sweden.

    Ó Séaghdha D, Korhonen A (2012) Modelling selectional preferences in a lexical hierarchy. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp 170-179 SemEval'12. Stroudsburg, PA, USA: Association for Computational Linguistics.

    Shetreet E, Palti D, Friedmann N, Hadar U (2007) Cortical Representation of Verb Processing in Sentence Comprehension: Number of Complements, Subcategorization, and Thematic Frames. Cereb Cortex 17:1958-1969.

    Tyler LK, Cheung TPL, Devereux BJ, Clarke A (2013) Syntactic computations in the language network: characterizing dynamic network properties using representational similarity analysis. Front Lang Sci 4:271.

    Tyler LK, Marslen-Wilson WD (1977) The On-Line Effects of Semantic Context on Syntactic Processing. J Verbal Learn Verbal Behav 16

    Making sense of abstraction in language

  • Proposer: Anna Korhonen and Felix Hill
  • Supervisor: Anna Korhonen and Felix Hill
  • Special Resources: None

    Description

    Abstraction is everywhere in semantics. To illustrate, consider the following examples:

    (1) That man's hat is too small for his head.

    (2) The CEO is the head of the company.

    The second sense of head is related to the first - the CEO controls the company in the way that the head controls the body, and the position of CEO in an organisation chart is at the top, just like a human head. This analysis tells us that the first sense is the primary, original sense and that the second sense is an abstraction of the first, which emerged over time as head became a well-understood concept in the community.

    Few if any approaches to the related tasks of word-sense-disambiguation (WSD) and word-sense-induction (WSI) consider that word senses are intimately connected with this process of abstraction. However, it is not hard to imagine how an automated system might disambiguate the senses in (1) and (2) by detecting that the second instance is more abstract that the first. One obvious approach would be to consider the context - in the first sentence the other noun hat is clearly concrete, whereas in the second, company is a more abstract noun.

    In this project, the student will exploit a newly released dataset containing concreteness ratings for 40,000 words in order to investigate how concreteness relates to word senses. A simple starting point would be to re-implement a state of the art word WSD or WSI system and then aim to improve it by integrating sensitivity to the concreteness of the context.

    Depending on the student's interest, such a system could also then be applied to interesting analyses of language on the internet and across social media platforms. For example, the student could investigate how abstraction in language correlates with education level, age or the political affiliation of the speaker.

    References:

    [1] Brysbaert, M., Warriner, A.B., Kuperman, V. (in press). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods.

    [2] Neuman,Y, D Assaf, Y Cohen. 2013. A cognitively motivated word sense induction algorithm. Computational Intelligence, Cognitive Algorithms, Mind, and Brain.

    [3] Suresh Manandhar, Ioannis Klapaftis, Dmitriy Dligach, Sameer Pradhan. 2010. SemEval-2010 Task 14: Word Sense Induction and Disambiguation

    [4] Turney, Peter D., et al. 2011 "Literal and metaphorical sense identification through concrete and abstract context." Proceedings of the 2011 Conference on the Empirical Methods in Natural Language Processing.

    Domain adaptation of biomedical text mining

  • Proposer:Anna Korhonen and YUfan Guo
  • Supervisor: Anna Korhonen and Yufan Guo
  • Special Resources: None

    Description

    Text mining has the potential to yield significant benefits in biomedical research. Among the most appealing ones are the ability to unlock hidden information in large collections of scientific literature, to develop new knowledge, and to improve the efficiency and quality of research process. However, a number of research challenges need to be addressed before these benefits can be fully realised. One of the most important challenges is to improve the portability of text mining. Because biomedicine shows significant sub-domain variation (Lippincott et al., 2011), techniques optimised for specific areas of biomedicine (e.g. molecular biology) do not necessarily port well to others (e.g. public health, neurology, physiology). Most current techniques rely on manually annotated, domain-specific datasets which are costly to obtain. Researchers have therefore begun to investigate the use of domain adaptation which uses labeled data from one (or several) source domains to learn a hypothesis performing well on a different domain for which no (or only little) labeled data is available (Blizer and Haume, 2010). Although some prior work exists in domain adaptation of biomedical text processing and mining (e.g. Miwa et al., 2012), there are many unexplored areas. This project will investigate domain adaptation in one of these areas, for example:

    This project would suit a student who is interested in machine learning and in applying NLP to biomedicine. Familiarity with machine learning toolkits such as Weka and Mallet would be useful (but is not necessary). Programming language of choice.

    References:

    Tom Lippincott, Diarmuid O Seaghdha and Anna Korhonen. 2011. Exploring subdomain variation in biomedical language. BMC Bioinformatics 12:212.

    John Blizer and Dal Haume III. 2010. ICML 2010 Tutorial on Domain Adaptation: http://adaptationtutorial.blitzer.com/

    Miwa, M., Thompson, P. and Ananiadou, S. (2012). Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics 2012.

    Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins, Johan Hogberg, and Ulla Stenius. 2011. A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinformatics, 69(12).

    Anna Korhonen, Diarmuid O Seaghdha, Ilona Silins, Lin Sun, Johan Hogberg and Ulla Stenius. 2012. Text mining for literature review and knowledge discovery in cancer risk assessment and research. PLoS ONE 7(4):e33427.