Department of Computer Science and Technology

Technical reports

Dictionary characteristics in cross-language information retrieval

Donnla Nic Gearailt

February 2005, 158 pages

This technical report is based on a dissertation submitted February 2003 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Gonville and Caius College.

DOI: 10.48456/tr-616


In the absence of resources such a as suitable MT system, translation in Cross-Language Information Retrieval (CLIR) consists primarily of mapping query terms to a semantically equivalent representation in the target language. This can be accomplished by looking up each term in a simple bilingual dictionary. The main problem here is deciding which of the translations provided by the dictionary for each query term should be included in the query translation. We tackled this problem by examining different characteristics of the system dictionary. We found that dictionary properties such as scale (the average number of translations per term), translation repetition (providing the same translation for a term more than once in a dictionary entry, for example, for different senses of a term), and dictionary coverage rate (the percentage of query terms for which the dictionary provides a translation) can have a profound effect on retrieval performance. Dictionary properties were explored in a series of carefully controlled tests, designed to evaluate specific hypotheses. These experiments showed that (a) contrary to expectation, smaller scale dictionaries resulted in better performance than large-scale ones, and (b) when appropriately managed e.g. through strategies to ensure adequate translational coverage, dictionary-based CLIR could perform as well as other CLIR methods discussed in the literature. Our experiments showed that it is possible to implement an effective CLIR system with no resources other than the system dictionary itself, provided this dictionary is chosen with careful examination of its characteristics, removing any dependency on outside resources.

