National Library of Wales, Peniarth 392D, f. 13
By permission of the National Library of Wales
John Daugman and
Linne Mooney (LRM3 at york.ac.uk)
BACKGROUND OF PROJECT
Five hundred years ago in Europe a major transition in the written dissemination of ideas took place, brought about by the invention of printing (and in England from its introduction by William Caxton in 1476). Historians of the book ("codicologists") are seeking to recover the trends in the dissemination of texts immediately preceding and after this revolution by identifying the scribes who copied medieval manuscripts in England at the end of the manuscript era (1375-1525). They are interested in questions such as: what texts did the scribes copy most; at what rate did they copy them, or how many manuscripts can be attributed to a single hand; is there evidence of a movement toward mass production and commercial sale of manuscripts immediately before the introduction of print in 1476?
The identification of same-scribe specimens of medieval English handwriting currently depends upon the expert eyes of specialists in dating handwriting ("palaeographers") because English scribes rarely signed their work. But thousands of manuscripts survive from the fifteenth century and there is as yet no known computational method for classifying or categorizing them by handwriting which could assist those trying to identify new specimens. Professor Linne Mooney, an expert codicologist-palaeographer based at York University would like to enlist us to devise algorithms for classifying manuscripts by scribe on the basis of their handwriting.
DETAILS OF PROJECT
Several samples of scribal handwriting by each of 10 scribes will be provided on a CD to students selecting this project, in JPEG images such as seen here, classified already by scribal identity. Transcriptions of these samples will also be provided as text files so that the "ground truth" about each character will be available. Students adopting this project will devise feature extraction algorithms, learning algorithms, and statistical classification algorithms, that segregate these 10 scribes as well as possible in accordance with their known identities. In Easter Term a further set of test samples of scribal handwriting will be provided: some by the original 10 scribes and some by others, all without their identities given. Students will be expected to test and to document the performance of their algorithms against this blind test set, and to analyse their algorithms' results, in their dissertations.
The heart of this project is finding features and creating mathematical representations of them that optimally separate the different scribal classes. Almost the entire science of pattern recognition is spawned by the question of a single relationship: the comparative magnitudes of within-class variability and between-class variability. Patterns can be reliably recognised, or classified, only if the representations of them express between-class variability that is larger than their within-class variability. (In terms of cluster analysis, this means that the spacings between clusters must be larger than the diameters of the clusters.) In the present context, this principle requires that the variability amongst the hands of different scribes, as represented in the computed set of features, should be greater than the variability in the representation of the hand of any given scribe over time or across manuscripts. How can this be accomplished?
Approaches to handwriting classification may be divided broadly into extraction
of "global" versus "local" features. Globally, specimens of handwriting by
medieval scribes differ in terms of the angle of slant, the size of writing, height
of lower-case letters in relation to ruled lines (how much white space between
lines), height of ascenders (of letters 'b', 'f', 'h', 'l', and medieval long 's')
or depth of descenders (of letters like 'g', 'p', 'q', and 'y') in relation to the
normal height of lower-case letters or ruled lines. For example, consider the
following sample by a scribe who may or may not be the same as the scribe who
wrote the sample above:
Trinity College, Cambridge, B.15.17, f. 125v
By permission of the Master and Fellows of Trinity College, Cambridge
Local analysis involves
individual letter forms, of which a scribe might have more than one in his writing
repertoire (one for more formal writing, one for more casual writing, for instance).
Palaeographers have identified a few letter-forms that vary most in the handwriting
of medieval English scribes and that might therefore be best for determining
between-class variability, and hence may be the best determiners for isolating
same-scribe specimens of handwriting: 'a', 'd', 'g', 'h', 'p', 'r', 's', 'w', and
medieval English representations of 'th' and ampersand. Here are some examples
of the letter 'g' by the same scribe:
(Peniarth, op cit.)
But you may find that other letter-forms are more distinctive from a computational point of view, or that only a few of these will suffice for finding distinctions and making classifications. Whatever types of discriminating features are selected, distance metrics are then constructed that gauge the similarity or dissimilarity of the script samples and plot their relationships in a feature space. It is often useful to represent the data not in terms of raw features but in terms of more abstract descriptions that are computed by mathematical operations such as Principal Components Analysis, or the Karhunen-Loeve Transform. Probability theory and statistical models then drive the classification process through a variety of possible data fusion, decision, and inference techniques.