Distributional methods and deep parsing

Proposer: Ann Copestake
Supervisor: Ann Copestake


Distributional semantic methods depend on the idea that words with similar meanings will appear in similar contexts. There is a considerable amount of work on distributional techniques in computational linguistics, but much of this concentrates on relatively simple, isolated tasks, such as measurements of similarity. If distributional techniques are providing information about meaning, it should be possible to use them in combination with other computational linguistics techniques, such as parsing and generation. It should also be possible to exploit very large scale parsed corpora for the extraction of distributions.

The two proposals listed below could be the basis for separate ACS projects, but it would also be possible to combine topics. See also Dhruv Kumar's (2011-12) project on ordering constraints in language generation.

  1. Extraction of distributions from a semantically annotated corpus.

    The WikiWoods corpus has been constructed by automatically parsing a dump of the English Wikipedia with a very detailed grammar of English. This is an interesting resource to investigate for the extraction of distributions. There has been considerable work on the use of syntactically parsed data for this purpose (see e.g., Pado and Lapata 2007), but the use of a semantic resource could improve results (e.g., because it allows generalisations over equivalent constructions with different syntax). There are various possible ways in which the parsed data could be utilised, and part of the project would involve developing a flexible approach for extraction of the data so that a range of options could be investigated.

    The distributions could be compared with distributions extracted from unparsed corpora on standard tasks such as similarity, or they could be used in one of the proposals described below. Ideally this project would lead to a resource which could be distributed as part of DELPH-IN.

  2. Parse ranking.

    Standard techniques for parse ranking have limited capability to determine the extent to which constructions are semantically plausible. This project will look at the use of distributional semantics to rerank the output from a broad-coverage grammar of English. The intuition behind the use of distributional semantics in parse ranking is that it offers an additional source of information to that which is generally used (i.e., treebanks). While some experiments have been carried out on the use of distributional techniques for syntactic disambiguation, these have been done on isolated test sets (e.g., of noun-noun compounds) and it is not clear whether they extend to more complex constructions and whether they would offer a real advantage compared with existing parsing models.

    Preliminary experiments were carried out in the 2009 JHU CSLP Summer Workshop (see Chapter 6 of the JHU Report). The large quantity of Treebanked data available from DELPH-IN would make it preferable to experiment with the DELPH-IN English Resource Grammar (ERG) for this project. The easiest approach to implement would be to rerank parses, but it may also be possible to experiment with adding distributionally-based features to the existing maximum entropy model.