§ 1 The handling of corrupted manuscripts poses new challenges to image acquisition and data editing. In the case of the project "Critical Edition of the New Sinaitic Glagolitic Euchology (Sacramentary) Fragments with the Aid of Modern Technologies" the data material consists of two parchment codices of the Old Church Slavonic canon dating from the eleventh century: the so-called Missale Sinaiticum (Sin. slav. 5/N) and the new part of the Euchologium Sinaiticum (Sin. slav. 1/N). Both fragments belong to the complex of new findings from St. Catherine’s Monastery on Mount Sinai in 1975. They show extensive damage including faded or blurred ink, staining due to mould or humidity, degradation of the parchment (e.g. chipping, fragmentation and contortion of folia), and the rare phenomenon of chemical conversion of black into white ink. The manuscripts also partly contain palimpsest (re-written) folia.
§ 2 For the decipherment and reconstruction of the texts mere philological means turned out to be insufficient (Miklas 2004), so in 2007 an interdisciplinary project of philologists (University of Vienna), computer scientists (image processing group PRIP, Vienna University of Technology) and material chemists (Vienna Academy of Fine Arts) was set up. The project has two major objectives: Research in technologies for the analysis and reconstruction of damaged manuscripts and the preparation of critical editions; the latter aim includes a paper edition with high quality facsimiles and data on the makeup of the manuscripts, as well as an electronic edition in the Manuscript database (Baranov forthcoming). The interdisciplinary cooperation of philologists and image processing experts provided the unique opportunity to develop computer aided routines assisting the philological research of manuscripts.
§ 3 The basis for the application of text reconstruction techniques is Multi-Spectral Imaging (MSI), which has proven to be a capable technique for the analysis and preservation of ancient documents (Easton, Knox, et al. 2003). Originally utilized in remote sensing applications such as earth observation and region classification (Richards, Xiuping 2005), at the beginning of this millennium researchers started applying this approach also to the examination of historical documents. Especially images of the UltraViolet (UV) and InfraRed (IR) light ranges reveal additional information hidden to the human eye (Mairinger 2003). Applied to ancient manuscripts, Easton et al. (2003) were the first to capture and enhance the erased underscript of the famous Archimedes palimpsest with MSI methods. It became apparent that the application of spectral imaging can improve readability, especially of damaged manuscripts (Rapantzikos, Balas 2005), more than conventional colour imaging procedures.
§ 4 On an expedition to Mount Sinai in 2007 we acquired a corpus of MSI data from approximately 150 folia from the relevant manuscripts. Nine spectral images were taken of each folio with a radiometric resolution of 12 bits and a spatial resolution of 565 dpi. The MSI acquisition system will be described in "Multispectral Image Acquisition," below.
§ 5 However, high quality raw digital images alone are not sufficient for the decipherment of damaged manuscripts. In order to exploit the full range of information in the MSI data, further procedures are necessary. Particularly a robust foreground-background separation is essential for further image enhancement and both automated and manual investigations in codicology, palaeography and text reconstruction. To separate text from background we combine the spectral signature of the MSI data with the spatial characteristics of characters or strokes by incorporating a Markov Random Field (MRF) model ("Foreground-Background Separation").
§ 6 In "Further Project Applications" we will also discuss some further applications that have been developed in the course of the project: (1) For character decomposition and feature extraction ("Character Decomposition and Primitive Extraction") the quality of the underlying binary input image is essential to achieve positive results. Here, the individual characters are thinned to a skeleton, dissected into their segments (primitives) and analysed according to criteria of linguistic graphetics (script description). For further interpretation the results can be transferred to a database ("Character Database"). As the character features can also be analysed and inserted in the database manually, we are now able to compare and evaluate manual and computational script description. (2) The tool for layout analysis ("Layout Analysis") supports codicological investigations such as ruling detection and script alignment (hanging or standing), the positioning of fragments, and the reconstruction of latent text material.
§ 7 All applications are tailored especially for damaged historical manuscripts and prepare the ground for subsequent examinations, philological (e.g. layout interpretation, the creation of sample alphabets, script and scribe analyses) as well as technical procedures (e.g. optical character recognition and scribe identification).
Multispectral Image Acquisition
§ 8 MSI applied in the spectral ranges from UV via VISible (VIS) up to the Near InfraRed (NIR) light range combines conventional imaging and spectrometry information of an object. The acquisition setup used for the Old Church Slavonic Sinaitic Euchologies consists of a digital colour camera and a scientific high resolution camera. Colour images and UV fluorescence images are captured with a Nikon D2Xs RGB camera providing images principally for visualisation purposes and the facsimile publications. For the multi-spectral images we use a Hamamatsu C9300-124 camera with a spectral sensitivity from UV to NIR (330nm-1000nm) and a resolution of 40002672 pixels. A filter wheel mounted in front of the camera selects different spectral images. Figure 1 shows the alignment of the two cameras.
§ 9 The setup yields a spatial resolution of 565 dpi for the multi-spectral images and a resolution of approximately 500 dpi for the conventional colour camera. Since every page is captured with both cameras, it is necessary to shift the folia from one camera to the other. To obtain multi-spectral images we use four band-pass (BP) filters with a peak of 450 nm (blue), 550 nm (green), 650 nm (red), 780 nm (NIR), two low-pass (LP) filters with a cut-off frequency of 400 nm (UV fluorescence) and 800 nm (IR reflectography), and a high-pass (HP) filter with a cut-off frequency of 400 nm to capture UV reflectography images. Using UV illumination for the UV reflectography and fluorescence, and VIS-NIR illumination for the other images we obtain nine different spectral images. The spectral transmittance of the filters is visualised in the electromagnetic spectrum in Figure 2.
|H1||HP 400||UV reflectography|
|H7||LP 800||IR reflectography|
§ 11 Due to the use of optical filters the images must be registered on one another before further processing (Brauers, Schulte, et al. 2008). The registration process used is described in Diem (2008). By the combination of several spectral bands (e.g. by principal component analysis) the readability could be considerably improved compared to the conventional RGB images (Miklas, Gau, et al. 2008).
§ 12 Since the beginning of computational historical document analysis researchers have been developing special methods for the analysis of degraded documents (Gatos, Pratikakis, et al. 2004). Contrary to previous studies which aim particularly at the enhancement of underwritten texts in palimpsest-manuscripts, we focus on the separation of text from background. For generating binary images from such ancient manuscripts, faded ink, mould stains, and the varying appearance of the text on different pages are the most challenging problems. Since independent component analysis methods are based on the assumption of the mutual independence of the sources (Tonazzini, Bedini, et. al. 2004), these methods are not applicable for our purpose.
Related Work and Overview
§ 13 A large number of publications, competitions (Gatos, Ntirogiannis, et al. 2009) and new applications are devoted to the subject of converting document images into binary images and separating text from background. Overview papers comparing different methods are given for instance in Gupta et al. (2007) or Leedham et al. (2003). In general, binarisation algorithms for document image analysis can be divided into three main classes (Garain, Paquet, et al. 2006):
- global thresholding,
- adaptive thresholding, or
- clustering algorithms.
§ 14 Global and adaptive thresholding techniques use a threshold T to distinguish between foreground and background and are therefore executed primarily on grey level images I(i,j). The resulting black and white binary image BW(i,j) is defined as:
§ 15 Clustering algorithms like k-means use the grey level or colour to classify similar observations into foreground or background (Leydier, et al. 2004). Generating binary images from colour or multispectral images is still not prevalent (Garain, et al. 2006). Alternatively, in the simplest way, colour images are converted into grey level images and then converted into binary images by thresholding.
§ 16 In the case of old and degraded documents, local methods outperform global ones. For instance, Sauvola and Pietikäinen (2000) developed an adaptive document image binarisation by determining a threshold for each pixel. Gatos et al. (2006) developed an adaptive degraded image binarisation algorithm following several distinctive steps: Wiener filtering, a rough estimation of the foreground region, the calculation of the background model, and finally post-processing methods like shrink and swell filtering to improve the quality. Recently Moghaddam and Cheriet (2009) proposed a method for the restoration of single-sided low-quality document images based on a multi-level classifier. This approach is designed for colour images of historical documents and is based on connected component labelling to capture spatially linked pixels of similar colour. Wolf and Doerman (2002) use MRF models for the binarisation of low quality text. Their model for the spatial relationship is defined on a neighbourhood of 4x4 pixel cliques. Cao and Govindarayu (2009) use MRFs for the binarisation of degraded handwritten forms where the spatial relations are obtained from a training set of high quality binarised images and consist of 114 representatives of shared patches.
§ 17 In contrast to the studies cited above we have developed an algorithm for foreground-background segmentation operating on multispectral images. Therefore we do not use threshold or clustering, but combine spectral and spatial features.
MRF Based Foreground-Background Separation of MSI Data
§ 18 Our main idea for foreground-background separation is to combine spatial and spectral features. Therefore we utilise an MRF model which provides a probability theory for analysing spatial or contextual dependencies (Li 2009). Emphasis on the combination of spatial and spectral components is particularly important in the case of ancient documents with varying textures (e.g. of the hair and flesh sides of parchment). Therefore, we include the visual nature of characters by means of stroke features. The only parameter we need is the mean stroke width, which can also be evaluated automatically (Pervouchine, Leedham, et al. 2005).
§ 19 The MRF model (Li 2009) is commonly represented by an energy function E(x) on a regular lattice S where Ψi(xi) are the data costs, i.e., the costs assigning the label xi to pixel i∈S . The second term Ψi,j(xi,xj) is the cost of assigning labels xi and xj (belonging to a clique C) to two neighbouring pixels, where ß is a weighting parameter controlling the prior. The labelling x with minimum energy E(x) is the optimal solution and can be found via energy minimization methods like belief propagation (Freeman, Pasztor, et al. 2000) or Iterated Conditional Modes (ICM) (Besag 1986).
§ 20 The prior model Ψi,j(xi,xj) reflects the fact that the segmentation of image regions is locally homogeneous. For the domain of character segmentation we propose to use higher order MRFs with cliques covering the stroke properties of the characters.
§ 21 For a regular lattice S the neighbour set N of i is defined as the set of nearby sites within a radius r. A first order MRF involves directly connected pixels in both the horizontal and vertical direction, see Figure 3a. The numbers n = 1 ... 5 in the figure indicate the neighbouring sites in an nth order neighbourhood system. As observed by Roth and Black (2005) MRF priors with small neighbourhood systems of only first order limit the expressiveness of the models. Thus C is the set of all pixels s within a radius r, where r corresponds to the radius of the mean stroke width. Our experiments for manually segmented characters allocate a mean stroke width of 5 pixels. Thus the prior considers a neighbourhood set of at least third or fourth order. To illustrate the neighbourhood system on our images consider Figure 3b. The figure shows one character from our data set with a white circle corresponding to a neighbourhood system of the fourth order.
§ 22 From the information of the multi-spectral image the observation model Ψi(xi) is extracted. The observation model or image process follows a normal distribution N(µ,Σ) (Kato, Pong 2006). Each class (foreground/text and background) is represented by its mean vector µ and covariance matrix Σ. The entities are modelled by a Gaussian mixture model.
§ 23 Energy minimisation in finding optimal solutions can be done either by local methods, like ICM, or global methods like simulated annealing (Li 2009). We used ICM to solve this energy which is a good trade-off between quality and computing time (Kato 2006). ICM uses a deterministic strategy to find local minima. It starts with estimation and then selects a label for each pixel that gives the largest decrease of the energy function. The process is repeated until convergence.
Experiments and Results
§ 24 Our segmentation method has been tested on a varied set of ancient documents. The test set consists of four folia for which ground truth data was generated manually by specialists of our philological team. An example can be seen in Figure 4a showing a single line from folio 29 recto from a single band with high contrast (BP 450). As stated in "Multispectral Image Acquisition," above, our MSI acquired nine spectral images with a spatial resolution of 565 dpi. The proposed MRF based binarisation approach is compared to state-of-the-art algorithms developed especially for historical, low contrast or noisy document images. The first method is an adaptation of the k-means algorithm proposed by Leydier et al. (2004), but in contrast to the original work we use the MSI data instead of a colour image as input, resulting in a nine dimensional feature space. Furthermore, we compare the MRF based approach to the Sauvola binarisation method (Sauvola, et al. 2000). Since thresholding algorithms perform on single gray level images, the method of Sauvola and Pietikannen (2006) is performed on a single band image with the best contrast (BP 450).
§ 25 To evaluate the results and to rank the performance of the different methods, we use the statistical measures of precision and recall (Gatos, Ntirogiannis, et al. 2009): where TP is the number of true positives, FP the number of false positives, and FN the number of false negatives. A pixel is classified as true positive if it is identified correctly. It is classified as false positive if it is wrongly classified as a match, i.e., pixels which are detected as text, although they belong to the background. Finally, false negative describes a foreground pixel that was wrongly classified as belonging to the background.
§ 26 The first experiment aimed at analysing the behaviour of the neighbourhood system N and the weighting parameter ß. Table 2 shows the precision and recall rate for ß = 0.1 and ß = 0.3 and a neighbourhood system n = 1…5. It can be seen that the recall values are very low for n < 3 which results from the background noise, i.e., noise is detected as text when the neighbourhood considered is too small. The best solution is obtained with n = 4 and ß = 0.1 which is in concordance to the proposed stroke characteristics.
|n = 1||n = 2||n = 3||n = 4||n = 5|
|n = 1||n = 2||n = 3||n = 4||n = 5|
§ 27 Generally it can be said that the smaller the considered neighbourhood system, the more noise emerges in the background. On the other side, a neighbourhood set considering too many pixels leads to missing characters or to closed character gaps or holes. The same can be said for the influence of ß.
§ 29 Especially for folio 17 recto and 30 verso the folia segmented with the k-means algorithm show only suboptimal results due to low contrast. The thresholding method shows a better recall rate of average 0.67, but has lower precision (0.65). The proposed MRF method has the best performance with an average precision rate of 0.89 and an average recall rate of 0.73. A description of the results is given for folio 29 recto. Figure 4 shows the results from folio 29 recto after the segmentation with the Sauvola binarisation method (4b), k-means clustering (4c), and the MRF approach (4d) with parameters ß = 0.1 and a neighbourhood of fourth order.
§ 30 It can be seen that especially the thresholding image contains noise in the background and even within the characters. The resulting binary image from the k-means method fails to identify the rightmost character, and others are broken. The result of the MRF approach segments even the rightmost character which has very low contrast. Moreover, the characters do not show missing or broken parts like the results from the k-means algorithm. Conspicuously, the recall rate is sometimes very high; for example, in folio 30 verso, when segmented with k-means. But it can be seen that the precision is very low at the same time, which denotes that the letters are detected rather as blobs than as characters. Thus the segmented characters cannot be identified any more.
Further Project Applications
§ 31 A robust binarisation algorithm forms the basis for a number of further analyses. In an interdisciplinary approach we built a toolbox for manuscript analysis that combines several computer applications supporting traditional codicological and palaeographic investigations. Some of them shall be presented in the following.
Character Decomposition and Primitive Extraction
§ 32 After the binarisation described in "Foreground-Background Separation," above, each character is reduced to a skeleton that serves as the basis for primitive extraction. The quality of the binary image is essential for the stroke extraction procedure and the number of true positives grows in correlation with the readability of the binary images.
§ 33 In our approach we adapted a linguistic method of graphetic script description developed by Miklas (1992 unpublished), which is based mainly on binary principles. He distinguishes two categories of character features: a) dynamic, pertaining to the realisation (production) of the letter, and b) static, describing the shape of the character, i.e., which characteristic elements, or primitives, does it consist of (perception). Sample static character features are:
|Feature 1||Number of static strokes|
|Feature 2||Number of nodes|
|Feature 3||Number of straight static strokes|
|Feature 4||Number of bent static strokes|
|Feature 5||Number of vertical static strokes|
|Feature 6||Number of horizontal static strokes|
|Feature 7||Number of loops|
|Feature 8||Number of open ends|
|Feature 9||Number of closed elements|
§ 34 Based on its skeleton each character-form is dissected into nodes and strokes. Nodes are defined as junctions in the skeleton with a minimum of three strokes crossing, while strokes are skeleton segments between nodes, from a node to an endpoint in the character (see Figure 5), or loops (Vill, Sablatnig 2008).
§ 35 The individual characteristics of each stroke (vertical, horizontal, straight, bent, etc.) constitute a sensitive classification system (Vill, Sablatnig 2008) that can even cover differences between individual hands.
§ 36 A precision and recall evaluation was executed on image data from the manuscripts, as well as on samples from a professional calligrapher (Table 5). Both datasets were also evaluated as ground truth data by our philological team. The average precision TP rate coincides closely with the ground truth data. The recall rate indicated good results with 0.20 FN for the Missale Sinaiticum and 0.17 for the calligrapher’s data set.
§ 37 These results proved that the adaptation of philological criteria for computer application is feasible for the evaluation of static graphetic features (for further details see Vill, Sablatnig 2008).
§ 38 After the automatic decomposition of static strokes, the calculated data can be transferred to a database (Gau, Vill, et al. 2010) with fundamental overview, query, statistics, and print out options (Figure 6, left side).
§ 39 The graphical user interface of the database also permits manual evaluation according to the same criteria as the automatic evaluation (Figure 6, right side). This comparison between computer and human perception provides ground truth for the quality of the computer's primitive extraction on the one hand and for psycholinguistics on the other.
§ 40 In a subsequent application we perform an analysis of the page layout (ruling, margins, decorations, etc.) based on the binary images provided by the separation step. The ruling (line) structure is automatically detected by the script flow and then extrapolated on those parts of damaged pages, where the script is not visible, by using a priori knowledge of the document layout scheme (for details see Kleber, Sablatnig, et al. 2008).
§ 41 Figure 7a illustrates a sample result of the semi-automatic layout analysis with red lines calculated from detected script and green and blue lines extrapolated according to the a priori knowledge (Figure 7b) of the layout scheme.
§ 42 This combination of IT with philological expertise made it possible to automate layout analysis even for damaged manuscripts to such an extent that it can automatically discern textual areas from margins and script from decorations or other high-contrast non-textual elements, like patches of mould.
Discussion and Conclusion
§ 43 In this paper it has been shown that the examination of corrupted manuscripts can be widely enhanced by MSI. Foreground-background separation based on spectral and spatial stroke features serves as input data for a number of consecutive investigations like character decomposition and layout analysis. The quality of the binary images is of crucial importance for these subsequent procedures. In the future, a more detailed evaluation of the foreground-background separation will include a larger variety of folia and the generation of ground truth data from several experts, in order to avoid subjectivity. Apart from the above described binarisation, in our project we also used MSI for the combination of spectral bands by principal component analysis, which increased the readability considerably. Some resulting spectral components highlight codicological details like the ruling, and the text- and decoration-inks.
§ 44 While computational script analysis methods have long been used in other disciplines like graphology, forensics, or pedagogy, they have been introduced only recently into the philological (historical) research areas of palaeography on the one hand, and graphetics and graphemics on the other. By combining philological and technical expertise we created a quantifiable method to classify (in our case: Glagolitic) characters, thereby also preparing the ground for optical character recognition of hand-written (especially damaged) documents. In the future we will expand the number of test samples and ground truth data from different experts, on the one hand, and test and expand the automatable catalogue of character features for its application to all alphabetic writing systems, on the other.
§ 45 With the evaluation of character features deriving from the automatic character decomposition we can for the first time apply modern database query routines also to traditional palaeographical questions and closely compare even very similar hands. This will further empirical answers to questions of cultural history, for example, about the provenance of early Glagolitic manuscripts.
§ 46 While the semi-automatic layout analysis still depends on a priori knowledge of the layout, it is useful for heavily degraded manuscripts, where the procedure supplies additional information beyond the mere layout scheme. Thus we can estimate, for example, the amount of missing text for the text reconstruction, detect palimpsest lines and identify missing page fragments (Kleber, Sablatnig 2010).
§ 47 As the routines for layout analysis, primitive extraction, and character decomposition have proven helpful for philological examination, they were integrated into a Toolbox for Manuscript Analysis (Gau, Vill, et al. 2010) with a graphical user interface for document and script analysis of manuscripts. For better usability the toolbox is currently being converted into a Java based environment.
Baranov, V.A. forthcoming. Machine-readable linguistic internet resources as a basis for historical philological studies. In 2 nd Int. Scientific Conf. on Applied Natural Sciences with a Special Section on Pedagogical, Philosophical, and Ethical Aspects (7-9 Oct. 2009). Trnava.
Diem, M. and R. Sablatnig. 2008. Registration of ancient manuscript images using local descriptors. In 14 th Int. Conf. on Virtual Systems and MultiMedia (VSMM) – Dedicated to Cultural Heritage, eds Ioannides, M, Addison, A, Georgopoulos, A and Kalisperis, L, 188-92. Limassol: Archaeolingua.
Easton, R.L., K.T. Knox and W.A. Christens-Barry. 2003. Multispectral imaging of the Archimedes palimpsest. In 32 nd Applied Image Pattern Recognition Workshop (AIPR), 111-18. Washington (DC): IEEE Computer Society.
Gau, M., M.C. Vill, F. Kleber, M. Diem, H. Miklas and R. Sablatnig. 2010. A Toolbox for Manuscript Analysis. In Computer Applications and Quantitative Methods in Archaeology (CAA). Making History Interactive, 86-92. Williamsburg (VA): Archaeopress.
Kleber, F. and R. Sablatnig. 2010. Scientific puzzle solving: Current techniques and applications. In Computer Applications and Quantitative Methods in Archaeology (CAA). Making History Interactive, (CD publication). Williamsburg (VA).
Kleber, F., R. Sablatnig, M. Gau and H. Miklas. 2008. Ruling estimation for degraded ancient documents based on text line extraction. In 2 nd Int. Conf. on Electronic Visualisation and the Arts (EVA). Digital Cultural Heritage – Essential for Tourism, 79-86. Vienna.
Leedham, G., C. Yan, K. Takru, J. Hadi, N. Tan and L. Mian. 2003. Comparison of some thresholding algorithms for text/background segmentation in difficult document images. In 7 th Int. Conf. on Document Analysis and Recognition (ICDAR), 859-64. Edinburgh.
Leydier, Y., F.L. Bourgeois and H. Emptoz. 2004. Serialized unsupervised classifier for adaptative color image segmentation: Application to digitized ancient manuscripts. In 17 th Int. Conf. on Pattern Recognition, 494-97. Cambridge.
Miklas, H. 2004. Analysis of traditional written sources with the aid of modern technologies. In 7 th Int. Conf. on Electronic Visualisation and the Arts (EVA). Information for All: Culture and Information Society Technologies. Moscow.
Miklas, H., M. Gau, F. Kleber, M. Diem, M. Lettner, M.C. Vill, R. Sablatnig, M. Schreiner, M. Melcher and E.-G. Hammerschmid. 2008. St. Catherine’s monastery on Mount Sinai and the Balkan-Slavic manuscript-tradition. In Slovo. Towards a Digital Library of South Slavic Manuscripts, eds Miklas, H and Miltenova, A, 13-36, 286 (Res.). Sofia: Bulgarian Academy of Science, Institute of Literature.
Vill, M.C. and R. Sablatnig. 2008. Static stroke decomposition of glagolitic characters. In 2 nd Int. Conf. on Electronic Visualisation and the Arts (EVA). Digital Cultural Heritage – Essential for Tourism, eds Sablatnig, R, Hemsley, J, Kammerer, P, Zolda, E and Stockinger, J, 95-102. Vienna.