Computer Laboratory

Technical reports

Automatic recognition of words in Arabic manuscripts

Mohammad S.M. Khorsheed

July 2000, 242 pages

This technical report is based on a dissertation submitted June 2000 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Churchill College.

Abstract

The need to transliterate large numbers of historic Arabic documents into machine-readable form has motivated new work on offline recognition of Arabic script. Arabic script presents two challenges: orthography is cursive and letter shape is context sensitive.

This dissertation presents two techniques to achieve high word recognition rates: the segmentation-free technique and the segmentation-based technique. The segmentation-free technique treats the word as a whole. The word image is first transformed into a normalised polar image. The two-dimensional Fourier transform is then applied to the polar image. This results in a Fourier spectrum that is invariant to dilation, translation, and rotation. The Fourier spectrum is used to form the word template, or train the word model in the template-based and the multiple hidden Markov model (HMM) recognition systems, respectively. The recognition of an input word image is based on the minimum distance measure from the word templates and the maximum likelihood probability for the word models.

The segmentation-based technique uses a single hidden Markov model, which is composed of multiple character-models. The technique implements the analytic approach in which words are segmented into smaller units, not necessarily characters. The word skeleton is decomposed into a number of links in orthographic order, it is then transferred into a sequence of discrete symbols using vector quantisation. the training of each character-model is performed using either: state assignment in the lexicon-driven configuration or the Baum-Welch method in the lexicon-free configuration. The observation sequence of the input word is given to the hidden Markov model and the Viterbi algorithm is applied to provide an ordered list of the candidate recognitions.

Full text

PDF (2.4 MB)

BibTeX record

@TechReport{UCAM-CL-TR-495,
  author =	 {Khorsheed, Mohammad S.M.},
  title = 	 {{Automatic recognition of words in Arabic manuscripts}},
  year = 	 2000,
  month = 	 jul,
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-495.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-495}
}