The Cambridge/Acquilex Lexical Database System

Introduction

The Lexical Database System (LDB) is a computer system which provides flexible access to machine-readable dictionaries. The LDB was developed at the University of Cambridge Computer Laboratory as part of the EU ESPRIT ACQUILEX project. It supports a user in formulating queries to retrieve subsets of entries from one or more dictionaries, implements the efficient retrieval of entries, and allows new 'derived' dictionaries to be created containing entries from a source dictionary, augmented and enriched with new information.

The LDB is currently in use on the ACQUILEX project to access a number of dictionaries - including the Longman Dictionary of Contemporary English 1st edition (see below), the MRC Psycholinguistic Database, Van Dale Dutch monolingual and bilingual dictionaries, and Vox Spanish dictionaries. The LDB was developed independently of any particular dictionary publisher, so to use it, the software has to be licensed from Cambridge University, and a compatible dictionary (e.g. one of the above) obtained directly from the appropriate publisher.

The LDB software can be licensed on either a research or a commercial basis. Contact John Carroll jac@cl.cam.ac.uk if you are interested. Outline terms:

research: fee 500 ECU or local currency equivalent - to be used solely for research purposes, its use acknowledged in any relevant publications, and not disseminated further;
commercial: fee 3000 ECU or local currency equivalent - no IPR restrictions (but again no further dissemination).

The LDB is written entirely in Common Lisp and will run on any machine which has an implementation of this language. There are implementations of Common Lisp for most UNIX machines, the Apple Macintosh family, and IBM PC-compatibles running Windows 3 or OS/2. To date the LDB has been tested under Austin Kyoto Common Lisp, Lucid CL, Franz Allegro CL and CL\PC, and Procyon CL. At present, it supports a graphical interface only when running under Allegro CL\PC or Procyon (on either the Macintosh or PC), or under Franz Allegro with Common Windows (on UNIX); however a textual TTY-style interface with the same functionality is available in all implementations. The LDB is supplied in source code form and requires no further software packages (apart from a Lisp implementation).

Background

In 'Database Models for Computational Linguistics' (EURALEX '90 Proceedings, Biblograf, 1992), B. Boguraev, E. Briscoe, J. Carroll and A. Copestake identify four classes of dictionary models. The first of these follows the well-established notion of relational databases, mapping dictionary entries into a set of tables. Although this relational model of the lexicon can take advantage of established database technology, it is generally agreed to be unsuitable for mapping dictionaries into, given the intricate nature of, and subtle interactions within, lexical data. The second class is the hierarchical model which employs a structured representation to encode the complex structural relationships between the fields of entries (exploiting the insight that dictionary entries can naturally be regarded as shallow hierarchies with an indefinite number of attributes at each level). The third class is the tagged model; in contrast to the hierarchical model which fails to preserve the visual, human-readable interrelationships amongst the contents of dictionary entries, this model places the emphasis on preserving all of the information associated with the original printed form of the dictionary entry, but in the process fails to offer a natural way of making explicit statements concerning the implicit structural relationships of the elements within the entry.

The fourth class of dictionary model, the two-level model aims at combining the advantages of the hierarchical and tagged models. This is the model implemented in the LDB. In the two-level model, the source dictionary is the primary repository of lexical data, and, separately from the dictionary source, sets of interrelated indices encode all statements about the structure and content of the data held in the dictionary. In the LDB, the 'mounting' of a new machine-readable dictionary consists mainly of defining what these indices are, how they are to be extracted from entries, and then telling the system to create permanent files on disc holding the indices. In fact, two types of indices are created: one type on the contents of headword fields (and also optionally on internal entry sequencing information on the typesetting tape), enabling access to entries via their headwords (similar to the traditional way of using printed dictionaries); the other type based on the contents of entries, allowing the dictionary to be queried and entries to be retrieved from it on the basis of elements and their relationships within entries, rather than just by headword.

Querying a Dictionary

A query consists of a hierarchical collection of attributes with associated values; for example the query

[[syn [gcode T1]] [sem [word show]]]

has two attributes at the top level: 'syn' and 'sem'; the attribute 'gcode' is beneath 'syn' with value 'T1', and 'word' beneath 'sem' with value 'show'. The user can construct and modify such queries interactively (the next section outlines the extensive facilities provided by the system), and then ask the system to retrieve all the entries which satisfy it. Several dictionaries can be loaded in the same session: they are all available for access concurrently.

Looking a query up is a two-stage process. The LDB first maps the query onto a collection of indices, determines which of these are the most discriminating (i.e. have the lowest frequency, based on statistics which were gathered during the creation of the index files), and finds an initial set of entries which satisfy this subset of indices by computing the intersection of the pointers from the index files to entries in the dictionary corresponding to the indices. The LDB then retrieves this set of entries from the source dictionary, checks which ones satisfy the rest of the (less discriminating) indices, and returns the ones which do as the final result. Crucial to the efficiency of dictionary query lookup is a good partition of the indices in the query into those used to form the initial candidate set of entries and those used to check these entries after they have been retrieved. The LDB bases its partition on estimates of the relative costs of reading and intersecting entry pointers versus reading and checking the entries themselves, on the numbers of pointers that will be read, and the expected probability of an entry succeeding in a check against a particular index.

When looking up a query, the LDB, by default, computes the answers in a sense-based (rather than an entry-based) fashion: that is, it returns just the senses which satisfy the query, not the whole entry (unless of course all the senses in the entry satisfy it). The LDB offers a number of options for the display of answers to a query: after informing the user of how many entries and senses satisfied the query, the LDB can be asked to display the headwords of all the results, of a sample of them, of the first one, or nothing. In addition, a user-defined option allows any portion of result entries to be displayed rather than simply the headword, and results can also be returned as entry pointers to allow arbitrary set operations (e.g. union, intersection, set difference) to be applied to them.

Derived Dictionaries

As well as supporting several dictionaries being available for access concurrently, the LDB allows the user to apply a single query to two or more dictionaries simultaneously, as long as all of the dictionaries concerned are derived from a single 'source' dictionary. (In fact, one of the dictionaries may be the source itself). A derived dictionary will typically consist of an elaboration of a subset of the information in the source dictionary: for example containing just the definition part of entries in the source, but having parsed representations of the definitions. The LDB also makes it straightforward to create derived dictionaries based on the entries returned from the lookup of a query.

Formulating Queries

As mentioned above, a query is represented as hierarchical collection of attributes with associated values. The basic values for attributes are usually atomic tokens (e.g. numbers, dictionary codes, or words) as in the example above. More complex types of value may be made out of these basic values, however, in order to express conjunctive and disjunctive queries and (atomic) negation. Basic values can be wildcarded with '?', matching any single sub-element, and '*' matching any sequence of sub-elements.

In addition to the attributes defined during the mounting of a dictionary, the LDB itself provides ones called 'headword' and 'constr' (short for constraint). 'Headword' allows a (partial, using wildcards) specification of the headword to be made on entries that will be retrieved. 'Constr' provides a way of expressing queries which cannot be formulated in a simple attribute-value form. The attribute can have one or more (conjunctive) values, each either a disjunction of a set of, or negation of a, dictionary index specification or call to Lisp.

The LDB gives the user the option of using either a TTY or a graphical interface (however the latter only when running on top of Procyon CL, Franz Allegro CL with Common Windows, or Franz Allegro CL\PC) for constructing queries, modifying them, looking them up, reading them from file, and saving them back to disc. Both interfaces provide full facilities for the quick and accurate manipulation of queries. The graphical interface makes extensive use of the mouse, windows, and pop-up and pull-down menus.

Using the LDB with the Longman Dictionary of Contemporary English

In the Longman Dictionary of Contemporary English (LDOCE), attribute names form a hierarchy, with 'syn' (syntax), 'sem' (semantics) and 'pron' (pronunciation) at the top. The LDOCE attribute name hierarchy is as follows:

syn
- cat: syntactic category
- c1: first compound field
- c2: second compound field
- gcode: grammar code (encoding complementation etc.)
- label: label field
- multiple: number of words in headword field, and whether hyphenated

sem
- antonym: from the definition field
- box: box codes (encoding selectional restrictions and other information)
- defn: 'implicit' x-refs and 'defining word' stems in definition
- order: order of words in definition
- subj: subject codes
- synonym: taken from the definition field
- word: actual words (excluding cross-references) in definition
- xref: 'explicit' x-refs in definition

pron
- nsylls: number of syllables
- s1, s2, s3, etc.,where si is the ith syllable
  - stress: stress for this syllable
  - onset: phonemes at the syllable onset
  - peak: phonemes at the syllable peak
  - coda: phonemes at the syllable coda