Suggested student project: Identification of Medieval Scribal Handwriting

Suggested student project:
Identification of Medieval Scribal Handwriting

(loading 25K)
National Library of Wales, Peniarth 392D, f. 13
By permission of the National Library of Wales

PROJECT ORIGINATORS: John Daugman and Linne Mooney (LRM3 at york.ac.uk)

BACKGROUND OF PROJECT

Five hundred years ago in Europe a major transition in the written dissemination of ideas took place, brought about by the invention of printing (and in England from its introduction by William Caxton in 1476). Historians of the book ("codicologists") are seeking to recover the trends in the dissemination of texts immediately preceding and after this revolution by identifying the scribes who copied medieval manuscripts in England at the end of the manuscript era (1375-1525). They are interested in questions such as: what texts did the scribes copy most; at what rate did they copy them, or how many manuscripts can be attributed to a single hand; is there evidence of a movement toward mass production and commercial sale of manuscripts immediately before the introduction of print in 1476?

The identification of same-scribe specimens of medieval English handwriting currently depends upon the expert eyes of specialists in dating handwriting ("palaeographers") because English scribes rarely signed their work. But thousands of manuscripts survive from the fifteenth century and there is as yet no known computational method for classifying or categorizing them by handwriting which could assist those trying to identify new specimens. Professor Linne Mooney, an expert codicologist-palaeographer based at York University would like to enlist us to devise algorithms for classifying manuscripts by scribe on the basis of their handwriting.

DETAILS OF PROJECT

Several samples of scribal handwriting by each of 10 scribes will be provided on a CD to students selecting this project, in JPEG images such as seen here, classified already by scribal identity. Transcriptions of these samples will also be provided as text files so that the "ground truth" about each character will be available. Students adopting this project will devise feature extraction algorithms, learning algorithms, and statistical classification algorithms, that segregate these 10 scribes as well as possible in accordance with their known identities. In Easter Term a further set of test samples of scribal handwriting will be provided: some by the original 10 scribes and some by others, all without their identities given. Students will be expected to test and to document the performance of their algorithms against this blind test set, and to analyse their algorithms' results, in their dissertations.

The heart of this project is finding features and creating mathematical representations of them that optimally separate the different scribal classes. Almost the entire science of pattern recognition is spawned by the question of a single relationship: the comparative magnitudes of within-class variability and between-class variability. Patterns can be reliably recognised, or classified, only if the representations of them express between-class variability that is larger than their within-class variability. (In terms of cluster analysis, this means that the spacings between clusters must be larger than the diameters of the clusters.) In the present context, this principle requires that the variability amongst the hands of different scribes, as represented in the computed set of features, should be greater than the variability in the representation of the hand of any given scribe over time or across manuscripts. How can this be accomplished?

Approaches to handwriting classification may be divided broadly into extraction of "global" versus "local" features. Globally, specimens of handwriting by medieval scribes differ in terms of the angle of slant, the size of writing, height of lower-case letters in relation to ruled lines (how much white space between lines), height of ascenders (of letters 'b', 'f', 'h', 'l', and medieval long 's') or depth of descenders (of letters like 'g', 'p', 'q', and 'y') in relation to the normal height of lower-case letters or ruled lines. For example, consider the following sample by a scribe who may or may not be the same as the scribe who wrote the sample above:
(loading 25K)
Trinity College, Cambridge, B.15.17, f. 125v
By permission of the Master and Fellows of Trinity College, Cambridge

Local analysis involves individual letter forms, of which a scribe might have more than one in his writing repertoire (one for more formal writing, one for more casual writing, for instance). Palaeographers have identified a few letter-forms that vary most in the handwriting of medieval English scribes and that might therefore be best for determining between-class variability, and hence may be the best determiners for isolating same-scribe specimens of handwriting: 'a', 'd', 'g', 'h', 'p', 'r', 's', 'w', and medieval English representations of 'th' and ampersand. Here are some examples of the letter 'g' by the same scribe:
(Peniarth, op cit.)
But you may find that other letter-forms are more distinctive from a computational point of view, or that only a few of these will suffice for finding distinctions and making classifications. Whatever types of discriminating features are selected, distance metrics are then constructed that gauge the similarity or dissimilarity of the script samples and plot their relationships in a feature space. It is often useful to represent the data not in terms of raw features but in terms of more abstract descriptions that are computed by mathematical operations such as Principal Components Analysis, or the Karhunen-Loeve Transform. Probability theory and statistical models then drive the classification process through a variety of possible data fusion, decision, and inference techniques.

REFERENCES

Duda R O, Hart P E, and Stork D G (2001) Pattern Classification, 2nd ed. (New York: Wiley).
Marius Bulacu, Lambert Schomaker (2003) "Writer style from oriented edge fragments," Proc. of 10th Int. Conf. on Computer Analysis of Images and Patterns (CAIP 2003): LNCS 2756, Springer, pp. 460-469, 25-27 August, Groningen, The Netherlands. (available in .pdf form here.)
Lambert Schomaker, Marius Bulacu, Merijn van Erp (2003) "Sparse-parametric writer identification using heterogeneous feature groups", Proc. of Int. Conf. on Image Processing (ICIP 2003), IEEE Press, pp. 545-548, vol. I, 14-17 September, Barcelona, Spain. (available in .pdf form here.)
Marius Bulacu, Lambert Schomaker, Louis Vuurpijl (2003) "Writer identification using edge-based directional features," Proc. of 7th Int. Conf. on Document Analysis and Recognition (ICDAR 2003), IEEE Press, pp. 937-941, vol. II, 3-6 August, Edinburgh, Scotland. (available in .pdf form here.)
(related work by Prof. Lambert Schomaker, Dr Marius Bulacu, and others in their group at the University of Groningen is accessible via these links for Lambert) and Bulacu.)
Zhu Y, Tan T, and Wang Y (2001) "Font recognition based on global texture analysis," IEEE Trans. Pattern Analysis and Machine Intelligence, 23(10), pp. 1192-1200.
Tan, T. N. (1998) "Rotation invariant texture features and their use in automatic script identification," IEEE Trans. Pattern Analysis and Machine Intelligence, 20(7), pp. 751-756.
Website on handwriting recognition by Srihari et al. at SUNY-Buffalo: http://www.cedar.buffalo.edu/papers/publications.html
A related project idea: http://www.cl.cam.ac.uk/~mgk25/only-cam/project-ideas-2003.html#genizah
Arazi B (1977) "Handwriting identification by means of run-length measurements," IEEE Trans. Syst., Man and Cybernetics, SMC-7(12), pp. 878-881.
Maarse F, Schomaker L, and Teulings H-L (1988) "Automatic identification of writers." In Human-Computer Interaction: Psychonomic Aspects pp. 353-360 (New York: Springer).
Saks and Risenger (1996) "Science and non-science in the courts: Daubert meets handwriting identification expertise," Iowa Law Review, available here
Late Medieval English Scribes