Department of Computer Science and Technology

Technical reports

Proximity visualisation of abstract data

Wojciech Basalaj

January 2001, 117 pages

This technical report is based on a dissertation submitted October 2000 by the author for the degree of Doctor of Philosophy to the University of Cambridge.

DOI: 10.48456/tr-509

Abstract

Data visualisation is an established technique for exploration, analysis and presentation of data. A graphical presentation is generated from the data content, and viewed by an observer, engaging vision – the human sense with the greatest bandwidth, and the ability to recognise patterns subconciously. For instance, a correlation present between two variables can be elucidated with a scatter plot. An effective visualisation can be difficult to achieve for an abstract collection of objects, e.g. a database table with many attributes, or a set of multimedia documents, since there is no immediately obvious way of arranging the objects based on their content. Thankfully, similarity between pairs of elements of such a collection can be measured, and a good overview picture should respect this proximity information, by positioning similar elements close to one another, and far from dissimilar objects. The resulting proximity visualisation is a topology preserving map of the underlying data collection, and this work investigates various methods for generating such maps. A number of algorithms are devised, evaluated quantitatively by means of statistical inference, and qualitatively in a case study for each type of data collection. Other graphical representations for abstract data are surveyed and compared to proximity visualisation.

A standard method for modelling prximity relations is multidimensional scaling (MDS) analysis. The result is usually a two- or three-dimensional configuration of points – each representing a single element from a collection., with inter-point distances approximating the corresponding proximities. The quality of this approximation can be expressed as a loss function, and the optimal arrangement can be found by minimising it numerically – a procedure known as least-squares metric MDS. This work presents a number of algorithmic instances of this problem, using established function optimisation heuristics: Newton-Raphson, Tabu Search, Genetic Algorithm, Iterative Majorization, and Stimulated annealing. Their effectiveness at minimising the loss function is measured for a representative sample of data collections, and the relative ranking established. The popular classical scaling method serves as a benchmark for this study.

The computational cost of conventional MDS makes it unsuitable for visualising a large data collection. Incremental multidimensional scaling solves this problem by considering only a carefully chosen subset of all pairwise proximities. Elements that make up cluster diameters at a certain level of the single link cluster hierarchy are identified, and are subject to standard MDS, in order to establish the overall shape of the configuration. The remaining elements are positioned independently of one another with respect to this skeleton configuration. For very large collections the skeleton configuration can itself be built up incrementally. The incremental method is analysed for the compromise between solution quality and the proportion of proximities used, and compared to Principal Components Analysis on a number of large database tables.

In some applications it is convenient to represent individual objects by compact icons of fixed size, for example the use of thumbnails when visualising a set of images. Because the MDS analysis only takes the position of icons into account, and not their size, its direct use for visualisation may lead to partial or complete overlap of icons. Proximity grid – an analogue of MDS in a discrete domain – is proposed to overcome this deficiency. Each element of an abstract data collection is represented within a single cell of the grid, and thus considerable detail can be shown without overlap. The proximity relationships are preserved by clustering similar elements in the grid, and keeping dissimilar ones apart. Algorithms for generating such an arrangement are presented and compared in terms of output quality to one another as well as standard MDS.

Full text

PDF (7.7 MB)

BibTeX record

@TechReport{UCAM-CL-TR-509,
  author =	 {Basalaj, Wojciech},
  title = 	 {{Proximity visualisation of abstract data}},
  year = 	 2001,
  month = 	 jan,
  url = 	 {https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-509.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  doi = 	 {10.48456/tr-509},
  number = 	 {UCAM-CL-TR-509}
}