Non-compositional Multiword expressions are combinations of two or more words (e.g. take place, spill the beans) for which the meaning of the whole cannot be fully determined from the meaning of their components (Baldwin & Kim 2010). These expressions are not only challenging for NLP tasks but also difficult for language learners. The aim of this project is to identify MWEs in text and link them to their definitions in dictionaries in order to facilitate reading comprehension for language learners.
Previous studies have focused independently on tagging text for MWEs (Schneider et al. 2014, Rohanian et al. 2019) or extracting them for building lexica (Cordeiro et al. 2019). However, a hybrid system that can identify MWEs, link them to corresponding entries in lexica and augment existing resources with possible new expressions is desired. For instance, in the following sentence, the components of the expression puts on should be identified and the expression should be linked to its canonical form (put on) in a lexical database (e.g. WordNet). As a result a definition can be acquired.
The aim of this project is to design a semi-supervised system that can tag MWEs in text and be connected to a lexicon simultaneusly. There is an academic shared task run between April 2020 and June 2020 very relevant to this project ( here is the link ). The implemented system can potentially participate in the shared task and lead to publication of a paper.
Python code for one of the top systems (Taslimipoor & Rohanian 2018) in the shared task on automatic identification of verbal MWEs will be provided.
PARSEME Shared Task 2020 on Semi-Supervised verbal MWE Identification
Timothy Baldwin and Su Nam Kim. 2010. Multiword expressions. In Handbook of Natural Language Processing, second edition., pages 267–292. CRC Press.
Silvio Cordeiro, Aline Villavicencio, Marco Idiart, and Carlos Ramisch. 2019. Unsupervised compositionality prediction of nominal compounds. Computational Linguistics, 45(1):1–57.
Carlos Ramisch et al. 2018. Edition 1.1 of the parseme shared task on automatic identificationof verbal multiword expressions. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pages 222–240. Association for Computational Linguistics.
Omid Rohanian, Shiva Taslimipoor, Samaneh Kouchaki, Le An Ha and Ruslan Mitkov. (2019). Bridging the gap: Attending to discontinuity in identification of multiword expressions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2692–2698.
Nathan Schneider, Emily Danchik, Chris Dyer, and Noah A. Smith. 2014. Discriminative lexical semantic segmentation with gaps: Running the MWE gamut. TACL, 2:193–206.
Shiva Taslimipoor and Omid Rohanian. 2018. SHOMA at PARSEME shared task on automatic identification of VMWEs: Neural multiword expression tagging with high generalisation. arXiv preprint arXiv:1809.03056.