The Acquisition of Lexical Knowledge

An integrated European research project will develop techniques and methodologies for utilising existing machine-readable dictionaries (MRDs) in the construction of lexical components for natural language processing (NLP) systems. The main focus of the project will be on extending existing techniques for processing single MRDs in a monolingual (and currently mostly English) context to the extraction of lexical information from multiple MRD sources in a multilingual context with the overall goal of constructing a single multilingual lexical knowledge base. The techniques developed will be either fully automated or automated to the extent that they represent a significant saving in resources compared to the manual construction of similar lexical components.

Approach and Methods

Machine-readable dictionaries are just one type of lexical resource; however, recent research has demonstrated their potential utility for the rapid and cost-effective construction of some aspects of the lexicons required by NLP systems. Research on MRDs has both computational and linguistic aspects. On the one hand, advanced computational techniques for modifying and exploring textual databases need to be developed, which are specifically geared to the organisation of dictionaries and which are capable of transcending the limitations of the conventional alphabetic organisation of the printed version. On the other hand, the insights of theoretical linguistics and artificial intelligence are essential in order to develop adequate and general-purpose representations for lexical systems across languages and to ensure that this information is usable by a wide variety of NLP systems.

The long-term goal of our research is the development of a multilingual knowledge base containing the most general and domain-independent aspects of lexical knowledge represented in a fashion which makes it maximally reusable. The knowledge base will be rooted in a common conceptual/semantic structure which is linked to, and defines, the individual word senses of the languages covered and which is rich enough to be able to support a `deep' knowledge-intensive model of language processing. The knowledge base will contain substantial general vocabulary with associated phonological, morphological, syntactic and semantic/pragmatic information capable of deployment in the lexical components of a wide variety of practical NLP systems.

The project will explore the feasibility of this long-term goal by further developing techniques for the (semi-)automated extraction of information from monolingual MRDs for English, Italian and Dutch and bilingual MRDs for English-Italian and English-Dutch. These MRDs will be loaded into a standardised lexical database system used by all the project partners and research will be undertaken to enable the linking, comparison and merging of information between separate dictionaries and the construction of improved derived lexicons which factor out some of the unreliability in individual MRD sources. Separate semantic taxonomies will be derived from monolingual dictionaries for English, Italian and Dutch for a common subset of vocabulary and these taxonomies will be merged to create an integrated lexical knowledge base prototype.

The utility of the lexical knowledge base prototype and of this approach to the construction of realistic lexical components for practical NLP systems will be evaluated by assessing its capability of supporting the lexical requirements of a prototype, illustrative multilingual NLP system capable of analysing a subset of English and generating Dutch or Italian through a common representation.

ACQUILEX-II (Esprit Project 7315)
The Acquisition of Lexical Knowledge

The project will build on and extend the results achieved in ACQUILEX by continuing research on theoretical issues in the design of lexicons and constructing further and more substantial monolingual and multilingual knowledge base fragments on the basis of a mixture of MRDs and manual encoding. At the same time, the project intends to make considerable use of corpora as a further source of data for the semi-automatic construction of lexical resources. Substantial quantities of textual and spoken-transcribed corpora are rapidly becoming available within the academic and dictionary publishing communities. Whilst MRDs provide a highly-structured and focussed source of lexical data, substantial corpora can supplement this information with information concerning usage, frequency, and so forth. The proposed project will develop the software tools required to enable efficient use of corpora and will utilise them in the development of dictionary databases and the lexical knowledge base. In addition, we plan to tap the expertise of professional lexicographers and of the dictionary publishing industry far more directly in the new project in the investigation of theoretical issues and by transfering the tools, techniques and insights of computational lexicology and lexicography to that community.

Approach and Methods

Work on ACQUILEX can be divided into two areas: the development of a methodology and the construction of software tools to create lexical databases from MRDs and the subsequent construction of illustrative theoretically-motivated, lexical knowledge base fragments from these databases, using a further set of software tools designed to integrate, enrich and formalise the database information. The emphasis of effort in ACQUILEX was on the development of lexical databases from MRDs and the design and implementation of a lexical representation language to underpin the lexical knowledge base. In the proposed project, the emphasis will be on the exploitation of these databases and the lexical knowledge base framework in the construction of more and more substantial lexicon fragments and on the investigation of theoretical issues in lexicon design within the context of the unique research environment provided by the lexical databases and analysed corpora.


The project aims to foster productive collaboration between the computational linguistic and lexicographical community. We envisage that the proposed research and training activities will contribute to improvements in the quality of (particularly bi/multilingual and learners') dictionaries, to improvements in the productivity of the dictionary publishing industry and will provide impetus to electronic publishing initiatives, in addition to the central goal of producing a multilingual lexical knowledge base which can be deployed in habitable and practical natural language processing applications.