Please contact us before applying for the project.
Non-compositional Multiword expressions are combinations of two or more words (e.g. take place, spill the beans) for which the meaning of the whole cannot be fully determined from the meaning of their components (Baldwin & Kim 2010). These expressions are not only challenging for NLP tasks but also difficult for language learners. The aim of this project is to identify MWEs in text and potentially link them to their definitions in dictionaries in order to facilitate reading comprehension for language learners.
Previous studies have focused independently on tagging text for MWEs (Schneider et al. 2014, Rohanian et al. 2019) or extracting them for building lexica (Cordeiro et al. 2019). However, a hybrid system that can identify MWEs, link them to corresponding entries in lexica and augment existing resources with possible new expressions is desired. For instance, in the following sentence, the components of the expression puts on should be identified and the expression should be linked to its canonical form (put on) in a lexical database (e.g. WordNet).
This hybrid approach can benefit from a combination of available lexica, annotaed datasets and large amount of unannotated text. The aim of this project is to design a semi-supervised system that can tag MWEs in text and be connected to a lexicon simultaneusly. The systems and papers published as part of the shared task on semi-supervised identification of verbal multiword expressions - edition 1.2 are very relevant to this project.
Python code for one of the top systems (Taslimipoor et al., 2020) in the shared task is available.
PARSEME Shared Task 2020 on Semi-Supervised verbal MWE Identification
Timothy Baldwin and Su Nam Kim. 2010. Multiword expressions. In Handbook of Natural Language Processing, second edition., pages 267–292. CRC Press.
Silvio Cordeiro, Aline Villavicencio, Marco Idiart, and Carlos Ramisch. 2019. Unsupervised compositionality prediction of nominal compounds. Computational Linguistics, 45(1):1–57.
Carlos Ramisch et al. 2018. Edition 1.1 of the parseme shared task on automatic identificationof verbal multiword expressions. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pages 222–240. Association for Computational Linguistics.
Omid Rohanian, Shiva Taslimipoor, Samaneh Kouchaki, Le An Ha and Ruslan Mitkov. (2019). Bridging the gap: Attending to discontinuity in identification of multiword expressions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2692–2698.
Nathan Schneider, Emily Danchik, Chris Dyer, and Noah A. Smith. 2014. Discriminative lexical semantic segmentation with gaps: Running the MWE gamut. TACL, 2:193–206.
Shiva Taslimipoor, Sara Bahaadini, and Ekaterina Kochmar. 2020. MTLB-STRUCT @PARSEME 2020: Capturing unseen multiword expressions using multi-task Learning and pre-trained masked language models. arXiv preprint arXiv:2011.02541