Subcategorization Acquisition as an Evaluation Method for WSD

Task Description

This task involves evaluating word sense disambiguation (WSD) systems in the context of automatic subcategorization acquisition. We have shown in our previous work that accurate WSD can improve the performance of a verbal subcategorization acquisition system (Preiss and Korhonen 2002, Preiss et al. 2002). When the corpus data is disambiguated accurately, the system uses correct sets of probability estimates for the acquisition process. This yields a more accurate subcategorization lexicon than the first sense heuristics (i.e. assuming the most frequent sense for all the corpus instances, and using only a single set of probability estimates for the acquisition process).

Our task will restrict to a set of 29 verbs. These are "hard" verbs: high in frequency and with multiple senses. The participants will be given the list of verbs in advance to allow a training phase (no training data will be made available). We will provide the test corpus. This will contain around 1000 instances of each verb, which the participants will be expected to annotate with WordNet 1.7.1 senses.

The participant's answers will be submitted in a standardized format to us (the format specification, identical to the other English tasks, can be found here). After receiving the sense annotated data, we will map the detected WordNet senses to our senses, which are based on broad Levin style verb classes (Levin, 1993). Levin's notion of a sense is fairly broad but adequate enough for our purposes.

We will feed the sense annotated data from each system to Anna Korhonen's subcategorization acquisition software. The more accurate the sense annotation is, the more comprehensive the probability estimates are used in the acquisition process, and the more accurate we can expect the acquired subcategoriation frames to be. The acquired frames will be evaluated against manually obtained gold standard frames, which will yield a ranking of the WSD systems.

Training/Test Data

No training data will be provided. Testing data will consist of up to 1000 sentences for each chosen verb. These sentences will be drawn from the same corpus as is used for the creation of the gold standard.

Evaluation Methodology

Evaluation will consist of mapping the submitted WordNet 1.7.1 answers to Levin senses, and generating a set of subcategorization frames for each verb from each system. It will not be possible to evaluate systems if too few instances are annotated. The acquired subcategorization frame distributions will be evaluated against gold standard distributions created previously (by Anna Korhonen). Using the method described in Korhonen (2002), we will generate a ranking of the submitted WSD systems.

Resources

No subcategorization acquisition resources directly made available to participants. The test corpus can be obtained from the Senseval site, the list of verbs used in this task is available through this site. Please let us know if there are any problems with either of these.

Bibliography: