The Cambridge/Acquilex Lexical Database System
Introduction
The Lexical Database System (LDB) is a computer system which provides
flexible access to machine-readable dictionaries. The LDB was
developed at the University of Cambridge Computer Laboratory as part
of the EU ESPRIT ACQUILEX project. It supports a user in
formulating queries to retrieve subsets of entries from one or more
dictionaries, implements the efficient retrieval of entries, and
allows new 'derived' dictionaries to be created
containing entries from a source dictionary, augmented and enriched
with new information.
The LDB is currently in use on the ACQUILEX project to access a number of
dictionaries - including the Longman Dictionary of Contemporary English 1st
edition (see below), the MRC Psycholinguistic Database, Van
Dale Dutch monolingual and bilingual dictionaries, and Vox Spanish dictionaries.
The LDB was developed independently of any particular dictionary publisher, so to
use it, the software has to be licensed from Cambridge University, and a
compatible dictionary (e.g. one of the above) obtained directly from the
appropriate publisher.
The LDB software can be licensed on either a research or a commercial basis.
Contact John Carroll
jac@cl.cam.ac.uk if you are interested. Outline terms:
- research: fee 500 ECU or local currency equivalent - to be used
solely for research purposes, its use acknowledged in any relevant
publications, and not disseminated further;
- commercial: fee 3000 ECU or local currency equivalent - no IPR
restrictions (but again no further dissemination).
The LDB is written entirely in Common Lisp and will run on any machine
which has an implementation of this language. There are implementations
of Common Lisp for most UNIX machines, the Apple Macintosh family, and
IBM PC-compatibles running Windows 3 or OS/2. To date the LDB has been
tested under Austin Kyoto Common Lisp, Lucid CL, Franz Allegro CL and
CL\PC, and Procyon CL. At present, it supports a graphical interface
only when running under Allegro CL\PC or Procyon (on either the
Macintosh or PC), or under Franz Allegro with Common Windows (on UNIX);
however a textual TTY-style interface with the same functionality is
available in all implementations. The LDB is supplied in source code
form and requires no further software packages (apart from
a Lisp implementation).
Background
In 'Database Models for Computational Linguistics' (EURALEX '90
Proceedings, Biblograf, 1992), B. Boguraev, E. Briscoe, J.
Carroll and A. Copestake identify four classes of dictionary models.
The first of these follows the well-established notion of relational
databases, mapping dictionary entries into a set of tables. Although
this relational model of the lexicon can take advantage of established
database technology, it is generally agreed to be unsuitable for
mapping dictionaries into, given the intricate nature of, and subtle
interactions within, lexical data. The second class is the hierarchical
model which employs a structured representation to encode the complex
structural relationships between the fields of entries (exploiting the
insight that dictionary entries can naturally be regarded as shallow
hierarchies with an indefinite number of attributes at each level). The
third class is the tagged model; in contrast to the hierarchical model
which fails to preserve the visual, human-readable interrelationships
amongst the contents of dictionary entries, this model places the
emphasis on preserving all of the information associated with the
original printed form of the dictionary entry, but in the process fails
to offer a natural way of making explicit statements concerning the
implicit structural relationships of the elements within the entry.
The fourth class of dictionary model, the two-level model aims at
combining the advantages of the hierarchical and tagged models. This is
the model implemented in the LDB. In the two-level model, the source
dictionary is the primary repository of lexical data, and, separately
from the dictionary source, sets of interrelated indices encode all
statements about the structure and content of the data held in the
dictionary. In the LDB, the 'mounting' of a new machine-readable
dictionary consists mainly of defining what these indices are, how
they are to be extracted from entries, and then telling the system to
create permanent files on disc holding the indices. In fact, two types
of indices are created: one type on the contents of headword fields
(and also optionally on internal entry sequencing information on the
typesetting tape), enabling access to entries via their headwords
(similar to the traditional way of using printed dictionaries); the
other type based on the contents of entries, allowing the dictionary to
be queried and entries to be retrieved from it on the basis of elements
and their relationships within entries, rather than just by headword.
Querying a Dictionary
A query consists of a hierarchical collection of attributes with
associated values; for example the query
[[syn [gcode T1]]
[sem [word show]]]
has two attributes at the top level: 'syn' and 'sem'; the attribute
'gcode' is beneath 'syn' with value 'T1', and 'word' beneath 'sem' with
value 'show'. The user can construct and modify such queries
interactively (the next section outlines the extensive facilities
provided by the system), and then ask the system to retrieve all the
entries which satisfy it. Several dictionaries can be loaded in the
same session: they are all available for access concurrently.
Looking a query up is a two-stage process. The LDB first maps the query
onto a collection of indices, determines which of these are the most
discriminating (i.e. have the lowest frequency, based on statistics
which were gathered during the creation of the index files), and finds
an initial set of entries which satisfy this subset of indices by
computing the intersection of the pointers from the index files to
entries in the dictionary corresponding to the indices. The LDB then
retrieves this set of entries from the source dictionary, checks which
ones satisfy the rest of the (less discriminating) indices, and returns
the ones which do as the final result. Crucial to the efficiency of
dictionary query lookup is a good partition of the indices in the query
into those used to form the initial candidate set of entries and those
used to check these entries after they have been retrieved. The LDB
bases its partition on estimates of the relative costs of reading and
intersecting entry pointers versus reading and checking the entries
themselves, on the numbers of pointers that will be read, and the
expected probability of an entry succeeding in a check against a
particular index.
When looking up a query, the LDB, by default, computes the answers in a
sense-based (rather than an entry-based) fashion: that is, it returns
just the senses which satisfy the query, not the whole entry (unless of
course all the senses in the entry satisfy it). The LDB offers a number
of options for the display of answers to a query: after informing the
user of how many entries and senses satisfied the query, the LDB can be
asked to display the headwords of all the results, of a sample of them,
of the first one, or nothing. In addition, a user-defined option allows
any portion of result entries to be displayed rather than simply the
headword, and results can also be returned as entry pointers to allow
arbitrary set operations (e.g. union, intersection, set difference) to
be applied to them.
Derived Dictionaries
As well as supporting several dictionaries being available for access
concurrently, the LDB allows the user to apply a single query to two or
more dictionaries simultaneously, as long as all of the dictionaries
concerned are derived from a single 'source' dictionary. (In fact, one
of the dictionaries may be the source itself). A derived dictionary
will typically consist of an elaboration of a subset of the information
in the source dictionary: for example containing just the definition
part of entries in the source, but having parsed representations of the
definitions. The LDB also makes it straightforward to create derived
dictionaries based on the entries returned from the lookup of a query.
Formulating Queries
As mentioned above, a query is represented as hierarchical collection
of attributes with associated values. The basic values for attributes
are usually atomic tokens (e.g. numbers, dictionary codes, or words) as
in the example above. More complex types of value may be made out of
these basic values, however, in order to express conjunctive and
disjunctive queries and (atomic) negation. Basic values can be
wildcarded with '?', matching any single sub-element, and '*' matching
any sequence of sub-elements.
In addition to the attributes defined during the mounting of a
dictionary, the LDB itself provides ones called 'headword' and 'constr'
(short for constraint). 'Headword' allows a (partial, using wildcards)
specification of the headword to be made on entries that will be
retrieved. 'Constr' provides a way of expressing queries which cannot
be formulated in a simple attribute-value form. The attribute can have
one or more (conjunctive) values, each either a disjunction of a set
of, or negation of a, dictionary index specification or call to Lisp.
The LDB gives the user the option of using either a TTY or a graphical
interface (however the latter only when running on top of Procyon CL,
Franz Allegro CL with Common Windows, or Franz Allegro CL\PC)
for constructing queries, modifying them, looking them up, reading them
from file, and saving them back to disc. Both interfaces provide full
facilities for the quick and accurate manipulation of queries. The
graphical interface makes extensive use of the mouse, windows, and
pop-up and pull-down menus.
In the Longman Dictionary of Contemporary English (LDOCE),
attribute names form a hierarchy, with 'syn' (syntax), 'sem'
(semantics) and 'pron' (pronunciation) at the top. The LDOCE attribute
name hierarchy is as follows:
- syn
- cat: syntactic category
- c1: first compound field
- c2: second compound field
- gcode: grammar code (encoding complementation etc.)
- label: label field
- multiple: number of words in headword field, and whether hyphenated
- sem
- antonym: from the definition field
- box: box codes (encoding selectional restrictions and other information)
- defn: 'implicit' x-refs and 'defining word' stems in definition
- order: order of words in definition
- subj: subject codes
- synonym: taken from the definition field
- word: actual words (excluding cross-references) in definition
- xref: 'explicit' x-refs in definition
- pron
- nsylls: number of syllables
- s1, s2, s3, etc.,where si is the ith syllable
- stress: stress for this syllable
- onset: phonemes at the syllable onset
- peak: phonemes at the syllable peak
- coda: phonemes at the syllable coda