VALEX - A Large Subcategorization
Lexicon for English Verbs
|
|
A BRIEF DESCRIPTION |
VALEX is a new large valency (subcategorization) lexicon for English
verbs which is suitable for (statistical) natural language processing (NLP), linguistic and psycholinguistic
use. The lexicon was
developed by members of the Natural Language and
Information Processing Group
at the University of Cambridge
Computer Laboratory.
It is freely available for non-commercial research purposes.
VALEX includes subcategorization frame (SCF) and frequency information
for 6,397 English verbs.
It assumes a classification of 163 SCF types (Briscoe, 2000) -
a superset of those found in the ANLT and COMLEX Syntax dictionaries.
The SCFs
abstract over specific lexically-governed particles and prepositions and specific
predicate selectional preferences but include some derived
semi-predictable bounded dependency constructions, such
as particle and dative movement.
The lexicon provides a lexical entry
for each verb and SCF combination. It includes 212,741
entries in total, 33 per verb on average.
VALEX differs from other existing
valency lexicons in the following ways:
- It was acquired automatically from five
large corpora (both British and American) and the Web. The
corpus data (consisting of 15.9M sentences in total) were processed using a recent version (Korhonen, 2002)
of the comprehensive subcategorization acquisition system of Briscoe and Carroll (1997).
-
Since the lexicon was acquired automatically, it contains some noise (i.e. incorrect SCF entries
and inaccurate frequencies).
Software is therefore provided with the lexicon which can be used to remove noise
from the lexicon, improve the quality of automatically acquired SCF distributions and/or
create sub-lexicons suitable for different purposes.
Four sub-lexicons (created by running the software with the best performing options)
are also provided for users which are more accurate than the basic lexicon and which can be
readily employed for tasks that require better accuracy.
-
The lexicon includes statistical information about the frequencies
and relative frequencies of SCFs in corpus data. This makes it particularly
suitable for statistical (NLP) use.
For a detailed description of the lexicon and how it was constructed see
Anna Korhonen, Yuval Krymolowski and Ted Briscoe. 2006.
A Large Subcategorization Lexicon for Natural Language Processing Applications
.
To appear in the Proceedinds of LREC. Genova, Italy.
PDF
| |
|
DOWNLOAD, COPYRIGHT NOTICE AND FEEDBACK |
The FIRST RELEASE of the lexicon: July 2006
We make the following materials available:
-
The description of the 163 SCF types in the lexicon
-
The large automatically acquired (unfiltered, noisy) subcategorization lexicon
-
Software which can be used to filter out noisy SCFs
from the large lexicon, improve the quality of automatically acquired SCF distributions,
and build sub-lexicons suitable for different purposes
-
Four sub-lexicons created using the
software which are more accurate than the basic noisy lexicon and which
can be readily employed by users who prefer not to run the software themselves
-
Documentation which explains the different
sub-lexicon options provided by the software and evaluates their accuracy
A link to the VALEX licence and download page is
HERE.
Please read the copyright notice:
Copyright © 2006 Anna Korhonen and Ted Briscoe,
University of Cambridge
Please acknowledge the use of the lexicon and the related related materials in any publications by
providing the appropriate reference and URL
(e.g. "VALEX (Korhonen, Krymolowski and Briscoe, 2006) is available at URL:
http://www.cl.cam.ac.uk/users/alk23/subcat/lexicon.html").
We would be pleased to receive comments on the materials provided here. Please
contact us with any feedback, questions or suggestions
you may have.
| |
|
REFERENCES |
Bran Boguraev and Ted Briscoe. 1987.
Large lexicons for natural language processing: utilising the grammar coding system of the
Longman Dictionary of Contemporary English.
In Computational Linguistics 13(3-4): 203-218.
PDF
Ted Briscoe. 2000. Dictionary and System
Subcategorisation Code Mappings. Unpublished manuscript,
University of Cambridge Computer Laboratory.
Included in the download materials above.
Ted Briscoe. 2001. From Dictionary to Corpus to Self-Organizing
Dictionary: Learning Valency Associations in the Face of Variation and Change.
In Proceedings of Corpus Linguistics. Lancaster University, UK.
PDF
Ted Briscoe and John Carroll. 1997. Automatic
Extraction of Subcategorization from Corpora.
In Proceedings of the Fifth Conference on
Applied Natural Language Processing. Washington, DC.
PS
Ralph Grishman, Catherine Macleod and Adam Meyers. 1994.
Comlex syntax: building a computational lexicon.
In Proceedings of the 15th International Conference on Computational Linguistics.
Kyoto, Japan.
PS
Anna Korhonen and Ted Briscoe. 2004.
Extended Lexical-Semantic Classification of English Verbs.
In Proceedings of the HLT/NAACL Workshop on Computational Lexical Semantics, Boston, MA.
PDF
Anna Korhonen, Genevieve Gorrell and Diana McCarthy. 2000. Statistical
Filtering and Subcategorization Frame Acquisition.
In Proceedings of the Joint SIGDAT Conference on
Empirical Methods in Natural Language Processing and Very Large Corpora. Hong Kong.
PS
Anna Korhonen. 2002. Subcategorization Acquisition.
PhD thesis published as Techical Report UCAM-CL-TR-530. Computer Laboratory, University of
Cambridge.
PDF
Anna Korhonen and Yuval Krymolowski. 2002. On the Robustness
of Entropy-Based Similarity Measures in Evaluation of Subcategorization Acquisition
Systems. In Proceedings of the Sixth Conference on Natural Language Learning.
Taipei, Taiwan.
PS
Anna Korhonen, Yuval Krymolowski and Ted Briscoe. 2006.
A Large Subcategorization Lexicon for Natural Language Processing Applications
.
To appear in the Proceedinds of LREC. Genova, Italy.
PDF
Anna Korhonen, Yuval Krymolowski and Zvika Marx. 2003. Clustering
Polysemic Subcategorization Frame Distributions Semantically.
In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics.
Sapporo, Japan. 64-71. PDF
Anna Korhonen and Judita Preiss. 2003. Improving Subcategorization
Acquisition using Word Sense Disambiguation.
In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics.
Sapporo, Japan. 48-55. PDF
Beth Levin. 1993. English Verb Classes and Alternations.
Chicago University Press.
Stat
|
|