Distributional semantics and authorship differences

Proposer: Ann Copestake
Supervisor: Ann Copestake

Description

Distributional semantic methods depend on the idea that words with similar meanings will appear in similar contexts. However, nearly all work in this area has used very large corpora, combining the data from many genres and many authors. This means that data from different senses is merged and any differences between authors obscured. The idea of this project is to look at distributions extracted from the work of different authors. One possible source of suitable corpora is Project Gutenberg: while the complete works of any one author will be small compared to the size of corpora usually used in distributional semantics, relatively common words should occur with sufficient frequency in the works of prolific authors such as Conan Doyle or Dickens to make the experiment worthwhile.

Better sentence splitting to improve parsing and translation performance

Proposer: Ann Copestake
Supervisor: Ann Copestake

Description

Many algorithms in parsing, generation and translation are intractable when applied to longer sentences. It is common in Statistical Machine Translation (SMT) to preprocess corpora by using heuristics to split sentences of more than around 20 words. The aim of this project is to see whether a principled approach to sentence splitting could be developed using the the topological properties of the dependency graph (specifically the Dependency Minimal Recursion Semantics graph). That is, whether a machine learning approach could be developed to determine where good splits could be made. The first stage of the project would be to attempt this for English, using the very large DMRS-annotated corpora already available: if successful, the project could be extended to other languages. Besides potentially improving SMT, such a system could be used to improve parser robustness and to aid Treebanking.

Using distributional semantics to expand the lexicon of a limited domain system

Proposer: Ann Copestake
Supervisor: Ann Copestake

Description

Ever since the first practical limited domain systems were built in the late 1970s, out of vocabulary utterances have been a problem. In the case of systems relying on manually constructed lexicons, the developer cannot predict all the possible ways in which a query can be phrased, while more modern machine learning approaches sometimes fail because the training data has inadequate coverage.

A possible solution is to use distributional techniques on non-domain specific data to identify the relationship of new lexical items to some existing concepts. For instance, if in the ATIS domain, the verb `book' has been mapped to an in-domain concept BOOKFLIGHT, then it is plausible that `reserve' could be mapped to the same concept, if the distributional similarity in a general corpus were above some appropriate threshold. `arrange', `get', `obtain', `purchase', `buy' and so on, could also plausibly map to the same concept. This is, therefore, a somewhat different problem from finding synonyms or near-synonyms.

The proposed approach is to start with the ATIS corpus and a hand-built mapping from the most frequent vocabulary items to domain concepts. The project will involve extending that mapping, using distributional models constructed from a large corpus such as Wikipedia. The first stage would be to acquire vocabulary which had a one-to-one correspondance with the hand-built lexicon, with later investigations targeting phrasal equivalents (e.g., `make a reservation').

Neural network simulations of the recognition of semantically ambiguous words

Proposer: Matt Davis
Supervisor: Matt Davis and Ann Copestake (or other NLIP UTO)

Description

This is a psycholinguistic project involving neural network simulation of the recognition of semantically ambiguous words (e.g., `bank') and the effect of prior and recent experience on meaning selection for these words (e.g., `river' vs `money'). It extends work carried out by Jenni Rodd and Matt Davis. Good Matlab skills would be necessary since the initial stage of the project would involve using existing Matlab code. Later stages would involve experimenting with different network architectures or learning algorithms.