Computer Laboratory

Technical reports

Automatic extraction of property norm-like data from large text corpora

Colin Kelly

September 2013, 154 pages

This technical report is based on a dissertation submitted September 2012 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Trinity Hall.

Abstract

Traditional methods for deriving property-based representations of concepts from text have focused on extracting unspecified relationships (e.g., "car — petrol") or only a sub-set of possible relation types, such as hyponymy/hypernymy (e.g., "car is-a vehicle") or meronymy/metonymy (e.g., "car has wheels").

We propose a number of varied approaches towards the extremely challenging task of automatic, large-scale acquisition of unconstrained, human-like property norms (in the form "concept relation feature", e.g., "elephant has trunk", "scissors used for cutting", "banana is yellow") from large text corpora. We present four distinct extraction systems for our task. In our first two experiments we manually develop syntactic and lexical rules designed to extract property norm-like information from corpus text. We explore the impact of corpus choice, investigate the efficacy of reweighting our output through WordNet-derived semantic clusters, introduce a novel entropy calculation specific to our task, and test the usefulness of other classical word-association metrics.

In our third experiment we employ semi-supervised learning to generalise from our findings thus far, viewing our task as one of relation classification in which we train a support vector machine on a known set of property norms. Our feature extraction performance is encouraging; however the generated relations are restricted to those found in our training set. Therefore in our fourth and final experiment we use an improved version of our semi-supervised system to initially extract only features for concepts. We then use the concepts and extracted features to anchor an unconstrained relation extraction stage, introducing a novel backing-off technique which assigns relations to concept/feature pairs using probabilistic information.

We also develop and implement an array of evaluations for our task. In addition to the previously employed ESSLLI gold standard, we offer five new evaluation techniques: fMRI activation prediction, EEG activation prediction, a conceptual structure statistics evaluation, a human-generated semantic similarity evaluation and a WordNet semantic similarity comparison. We also comprehensively evaluate our three best systems using human annotators.

Throughout our experiments, our various systems’ output is promising but our final system is by far the best-performing. When evaluated against the ESSLLI gold standard it achieves a precision of 44.1%, compared to the 23.9% precision of the current state of the art. Furthermore, our final system’s Pearson correlation with human- generated semantic similarity measurements is strong at 0.742, and human judges marked 71.4% of its output as correct/plausible.

Full text

PDF (1.0 MB)

BibTeX record

@TechReport{UCAM-CL-TR-839,
  author =	 {Kelly, Colin},
  title = 	 {{Automatic extraction of property norm-like data from large
         	   text corpora}},
  year = 	 2013,
  month = 	 sep,
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-839.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-839}
}