-
NLP in biomedical subdomains
My research at Cambridge studies problems presented by the domain-specific linguistic properties of biomedical texts. The massive amount of data that is added each year to medical journals is impossible for end-users (doctors, nurses, etc) to use directly. Moreover, scientific language can differ substantially from everyday language, making general-purpose resources unsuitable. Determining the degree to which this affects various specialties will demonstrate where resources are most lacking.
-
Verb subcategorization
Lexical information about verbs is particularly useful for a field like biomedicine, where new terms are frequently coined and information extraction is a common task.
-
Linguistic effects of transmission and translation
Our most culturally important literature is often of great antiquity, originally composed in the dead language of non-contemporary people, and has passed through any number of modifying forces to reach its modern form. The range of historical certainty about these forces may be used to model these forces, which can be applied in an unbiased fashion to uncover new hypotheses. The most straightforward situation is authorship identification (c.f. the canonical example on the Federalist Papers).
-
Graphical models
In all these areas, I'm interested in building generative models that explain the data. Such models have the potential to tie together many linguistic dimensions in a principled fashion.