ACQUILEX (BRA-3030)
The Acquisition of Lexical Knowledge
An integrated European research project will develop techniques and
methodologies for utilising existing machine-readable dictionaries (MRDs) in
the construction of lexical components for natural language processing
(NLP) systems. The main focus of the project will be on extending existing
techniques for processing single MRDs in a monolingual (and currently
mostly English) context to the extraction of lexical information from
multiple MRD sources in a multilingual context with the overall goal of
constructing a single multilingual lexical knowledge base. The techniques
developed will be either fully automated or automated to the extent that
they represent a significant saving in resources compared to the manual
construction of similar lexical components.
Approach and Methods
Machine-readable dictionaries are just one type of lexical resource; however,
recent research has demonstrated their potential utility for the rapid and
cost-effective construction of some aspects of the lexicons required by
NLP systems. Research on MRDs has both computational and linguistic aspects.
On the one hand, advanced computational techniques for modifying and
exploring textual databases need to be developed, which are specifically
geared to the organisation of dictionaries and which are capable of
transcending the limitations of the conventional alphabetic organisation
of the printed version. On the other hand, the insights of theoretical
linguistics and artificial intelligence are essential in order to develop
adequate and general-purpose representations for lexical systems across
languages and to ensure that this information is usable by a wide variety
of NLP systems.
The long-term goal of our research is the development of a multilingual
knowledge base containing the most general and domain-independent aspects
of lexical knowledge represented in a fashion which makes it maximally
reusable. The knowledge base will be rooted in a common conceptual/semantic
structure which is linked to, and defines, the individual word senses of
the languages covered and which is rich enough to be able to support a
`deep' knowledge-intensive model of language processing. The knowledge
base will contain substantial general vocabulary with associated
phonological, morphological, syntactic and semantic/pragmatic information
capable of deployment in the lexical components of a wide variety of
practical NLP systems.
The project will explore the feasibility of this long-term goal by further
developing techniques for the (semi-)automated extraction of information
from monolingual MRDs for English, Italian and Dutch and bilingual MRDs
for English-Italian and English-Dutch. These MRDs will be loaded into a
standardised lexical database system used by all the project partners
and research will be undertaken to enable the linking, comparison and
merging of information between separate dictionaries and the construction
of improved derived lexicons which factor out some of the unreliability
in individual MRD sources. Separate semantic taxonomies will be derived
from monolingual dictionaries for English, Italian and Dutch for a
common subset of vocabulary and these taxonomies will be merged to create an
integrated lexical knowledge base prototype.
The utility of the lexical knowledge base prototype and of this approach
to the construction of realistic lexical components for practical NLP
systems will be evaluated by assessing its capability of supporting the
lexical requirements of a prototype, illustrative multilingual NLP system
capable of analysing a subset of English and generating Dutch or Italian
through a common representation.
ACQUILEX-II (Esprit Project 7315)
The Acquisition of Lexical Knowledge
The project will build on and extend the results achieved in ACQUILEX
by continuing research on theoretical issues in the design of lexicons
and constructing further and more substantial monolingual and
multilingual knowledge base fragments on the basis of a mixture of
MRDs and manual encoding.
At the same time, the project intends to make considerable use of corpora as a
further source of data for the semi-automatic construction of lexical
resources. Substantial quantities of textual and spoken-transcribed corpora are
rapidly becoming available within the academic and dictionary
publishing communities. Whilst MRDs provide a highly-structured and
focussed source of lexical data, substantial corpora can supplement
this information with information concerning usage, frequency, and so
forth. The proposed project will develop the software tools required
to enable efficient use of corpora and will utilise them in the
development of dictionary databases and the lexical knowledge base.
In addition, we plan to tap the expertise of professional lexicographers and of
the dictionary publishing industry far more directly in the new project in the
investigation of theoretical issues and by transfering the tools,
techniques and insights of computational lexicology and lexicography
to that community.
Approach and Methods
Work on ACQUILEX can be divided into two areas:
the development of a methodology and the construction of software tools
to create lexical databases from MRDs
and the subsequent construction of illustrative theoretically-motivated,
lexical knowledge base fragments
from these databases, using a further set of software tools designed to
integrate, enrich and formalise the database information.
The emphasis of effort in ACQUILEX was on the
development of lexical databases from MRDs and the design and
implementation of a lexical representation language to underpin the
lexical knowledge base. In the proposed project, the emphasis will be
on the exploitation of these databases and the lexical knowledge base
framework in the construction of more and more substantial lexicon
fragments and on the investigation of theoretical issues in lexicon
design within the context of the unique research environment provided
by the lexical databases and analysed corpora.
Potential
The project aims to foster productive collaboration between the
computational linguistic and lexicographical community. We envisage
that the proposed research and training activities will contribute to
improvements in the quality of (particularly bi/multilingual and
learners') dictionaries, to improvements in the productivity of
the dictionary publishing industry and will provide impetus to
electronic publishing initiatives, in addition to the central goal of
producing a multilingual lexical knowledge base which can be deployed
in habitable and practical natural language processing applications.