Computer Laboratory

Technical reports

A text representation language for contextual and distributional processing

Eric K. Henderson

April 2010, 207 pages

This technical report is based on a dissertation submitted 2009 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Fitzwilliam College.

Abstract

This thesis examines distributional and contextual aspects of linguistic processing in relation to traditional symbolic approaches. Distributional processing is more commonly associated with statistical methods, while an integrated representation of context spanning document and syntactic structure is lacking in current linguistic representations. This thesis addresses both issues through a novel symbolic text representation language.

The text representation language encodes information from all levels of linguistic analysis in a semantically motivated form. Using object-oriented constructs in a recursive structure that can be derived from the syntactic parse, the language provides a common interface for symbolic and distributional processing. A key feature of the language is a recursive treatment of context at all levels of representation. The thesis gives a detailed account of the form and syntax of the language, as well as a treatment of several important constructions. Comparisons are made with other linguistic and semantic representations, and several of the distinguishing features are demonstrated through experiments.

The treatment of context in the representation language is discussed at length. The recursive structure employed in the representation is explained and motivated by issues involving document structure. Applications of the contextual representation in symbolic processing are demonstrated through several experiments.

Distributional processing is introduced using traditional statistical techniques to measure semantic similarity. Several extant similarity metrics are evaluated using a novel evaluation metric involving adjective antonyms. The results provide several insights into the nature of distributional processing, and this motivates a new approach based on characteristic adjectives.

Characteristic adjectives are distributionally derived and semantically differentiated vectors associated with a node in a semantic taxonomy. They are significantly lower-dimensioned then their undifferentiated source vectors, while retaining a strong correlation to their position in the semantic space. Their properties and derivation are described in detail and an experimental evaluation of their semantic content is presented.

Finally, the distributional techniques to derive characteristic adjectives are extended to encompass symbolic processing. Rules involving several types of symbolic patterns are distributionally derived from a source corpus, and applied to the text representation language. Polysemy is addressed in the derivation by limiting distributional information to monosemous words. The derived rules show a significant improvement at disambiguating nouns in a test corpus.

Full text

PDF (2.6 MB)

BibTeX record

@TechReport{UCAM-CL-TR-779,
  author =	 {Henderson, Eric K.},
  title = 	 {{A text representation language for contextual and
         	   distributional processing}},
  year = 	 2010,
  month = 	 apr,
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-779.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-779}
}