On the Classification of the Slavic Menaia Manuscripts Dated from the 11th to 14th Centuries

In the present paper we analyze nine manuscripts from the 11th–14th century Menaia (Greek: μηνάιον), Old-Slavic hymnographic texts, using a vector space model. The analysis and classification of the manuscripts in previous studies have been rather subjective. In an attempt to be objective we use contemporary research methods. Vector analysis allows one to separate the Putyatina Menaion and the Menaion Q.п.1.25 from the set of analyzed texts, since both manuscripts share both textological and lexical similarities. Similar findings were reached in existing studies. Manuscript BAN 16.14.13 is shown to be quite similar to the set of analyzed texts. The results are new to the literature.


Introduction
The Menaion is a liturgical work containing texts that glorify saints and holy days.
These texts are arranged according to the days in each month.The Menaion with its multiple manuscripts preserves a unity of existence as a text, while retaining differences on textual and language levels.Hence the importance of the classification of the collection of manuscripts.This collection is represented according to the Consolidated Catalogue (CC) of 68 manuscripts from the 11 th -13 th centuries (Consolidated Catalogue 1984).The largest number of manuscripts of that period, eleven to be exact, preserve the May Menaion.
The classification tradition of the hymnographic texts goes back to I. V. Yagich (1886).This tradition established the following lexical and textological parameters for the classification of the manuscripts: the composition of the manuscript (set of memorial texts), the structure of the service (the order of the hymns) and specific linguistic discrepancies.Among the contemporary researchers of the Menaion who adhere to the same approach, Maria Yovcheva should be mentioned.She describes the Service Menaion for April Sof.199 of the 12 th -13 th centuries from the collection of the National Library of Russia (NLR), taking into account the calendar features of the manuscript (set of memorial hymns), the structure of the service (order of the hymns), as well as the morphological, lexical and syntactic differences (Yovcheva 2014).
Textual typology and linguistic differences sort the May Menaia manuscripts into four types (Netšunajeva 2000): 1. Putyatina Menaion and its analogues (archaic type); The composition of the manuscript and the structure of services are the typical indicators for the typological characteristics of the manuscript; the set of memorial hymns and their order is associated with a particular statute.Taking into account the set of memorial hymns, the second-type manuscripts of the 11 th -14 th centuries correspond mainly to the Studite statute, while the manuscripts of the third and Netsunajev and Netsunajeva: On the Classification of the Slavic Menaia Manuscripts Dated from the 11 th to 14 th Centuries Art.1 , page 3 of 17 fourth type correspond to the Jerusalem statute.In Menaia manuscripts three possible orders of hymns can be distinguished: 1) canon (Greek: kanwn) -stichos (Greek: στχηròn, pl.-ά) -sedalen (Greek: κάθισμα); 2) sedalen -stichos -canon; 3) stichos -canon interrupted after the third hymn by sedalen and after the 6th hymn by kontakion (Greek: κοντάκιον) and ikos (Greek: 'Hχoς).The first two types of hymn-ordering are in accordance with the genre of the service; the third follows the order during the service.The first order of hymns is a typical feature of the archaic type Menaion.The collection of this type currently consists of four Old Russian manuscripts, including the Putyatin Menaion, five South-Slavic manuscripts and one Greek manuscript (Netšunajeva 2008).
The classification may become more detailed.For example, the May Menaion Sof.203 of the 12 th century can be distinguished from the Studite manuscripts.The manuscript shows that within a large array of similar Studite type manuscripts there exists a non-standard subtype.Manuscript Sof.203 has textual features associated with a set of memorial hymns according to the Jerusalem statute, and specific discrepancies, such as the restoration of the Greek in the text: instead of (Netšunajeva 2011).
The researchers who follow the textual traditions established in the description of the manuscripts of the hymnography of the Old and New Testaments take into account the aforementioned differences in various ways (Alekseev 1999).However, these methods of classification do not fully exploit the information available in the manuscripts of the Menaion.In this article we propose to use methods of classification based on the properties of the data and, in this case, on the information concerning discrepancies contained in the manuscripts of the May Menaia.The sources for our analysis are the manuscripts of Menaion of the first and second type.
The differences that we propose to speak in favor/against the similarity of texts are lexical and grammatical discrepancies.Phonetic differences at this stage of the analysis are excluded.
The article uses a new approach to the classification and analysis of the Menaion manuscripts, which we inherit from the literature on information retrieval.We find Menaion manuscripts of type one and two that share many common elements with the help of a mathematical model.To formally analyze the manuscripts we set up a vector space model where the manuscripts are represented as vectors in a common vector space.Using this tool we can identify similar and different manuscripts, and classify manuscripts of the first and second types.Groups formed as a result of the analysis consist of Menaion manuscripts that have common lexical units and grammar.
In the considered manuscripts there are more lexical variations across fragments than there are phonetic or grammar variations.Lexical variations are important from the point of view of presence or absence of words in territorial and diachronic systems and in texts with different translation methods.Analysis of lexical variations allows one to make statements on the origin and date of manuscripts as well as topography.The grammar variations show the historical development of the language.
This paper shows that the methods of information retrieval can be successfully applied to the analysis of ancient texts.Obtained results allow us to identify similar and different lists of Menaion manuscripts and to relate the results to the existing literature.Specifically, we can (i) detect the manuscripts having textological similarities supported by specific lexical variations, (ii) show the manuscripts that deviate from the majority of the analyzed texts, (iii) show typical and atypical usage in the language and in the structure of the text.
The remainder of the article is organized as follows.Section 2 describes and explains the choice of Menaia and of the textual fragments.In Section 3 we construct the vector space model.Detailed analysis of the results obtained using the model is discussed in Section 4. The last section reports our conclusions.

The Manuscripts of the Menaion
For the vector model analysis we took nine manuscripts of the May Menaion of the 11 th -14 th centuries: seven are described in CC, two -in the PL.A brief description of the manuscripts and their storage is shown in Table 1.The aforementioned fragments are the only fragments to be present in the nine manuscripts of the analyzed period.They are thus unique in the sense that they offer a possibility for a full textual comparison of the manuscripts.It is worth noting that Q.п.1.25 is an outstanding manuscript.It consists of a single folio containing the end of the 9 th song of the canon, two fragments of stich and the beginning of a sedalen.Therefore, the material for analysis is very limited due to the remaining fragment length of the manuscripts.For these reasons we believe that the analysis based on the selected fragments is representative and is using the parallel information available in all nine May Menaia in full.

Vector Space Model
For the analysis, we used a vector space model and represent the manuscripts as vectors in a common vector space (Salton, Wong and Yang 1975;Dubin 2004;Jockers 2014).Formally, we define: D = {d1,…, dn} -set (corpus) of documents T = {t1,…,tm} -terms contained in the documents, or a dictionary.In our case the term can be single words or phrases.
The next step was to determine the weight of a term in a document.Weight is the importance of the term within the specific manuscript.Vector representation of a document is a vector comprising the weights of each term in the document.If there is no term in the document, the weight is zero.For document i, for instance: Netsunajev and Netsunajeva: On the Classification of the Slavic Menaia Manuscripts Dated from the 11 th to 14 th Centuries Art.1 , page 7 of 17 W(i) = {w 1i , w 2i ,.., w mi } T , where w ji -weight of j term in document i.
There are several standard ways of determining the weighting function (Leskovec et al. 2014).This paper uses the tf-idf (term frequency -inverse document frequency) weighting function.The weighting function is calculated as follows: tf (term frequency) estimates the importance of the term t i in document d j tf(t i , d j ) = n i , n i -number of entries for a term t i in document d j .
idf (inverse document frequency) -is the inverse of the frequency with which a term occurs in the corpus.It is calculated as follows: 10 ( , ) log : where |D| is the number of documents, and |d ∈ D: t i ∈ d| is the number of documents with the term t i .Therefore, the tf-idf weight can be calculated as: Thus each document is now represented as a vector of weights in the space R m .This allows us to calculate measures of distance between the vectors using certain metrics.There are a vast number of metrics to choose from (see Cha 2007 for comprehensive survey).In this paper we calculate the Euclidean distance between the manuscripts and cosine similarity.These are standard similarity measures used in vector space models and clustering (Cha 2007).From our point of view, the straight line measures are more suitable for text analysis than, for instance, block metrics.The former are more intuitive for our application as there should not be any obvious sources of non-linearity that prevent calculation of straight line distance between two vectors of sample manuscripts.The two measures -Euclidean distance and cosine similarity -address the same question but from a different angle.The first measure allows us to identify the manuscripts with the greatest amount of variety, while the second allows us to identify those containing similarities.The analysis of two distance measures may be seen as a complementary exercise.The distance is calculated as: We implemented the vector space model as follows.We obtained the fragments of stich from the original manuscripts.This work was done partly in the libraries as only manuscripts PM, Sin.166, Sof.203, Sof.204 and Q.п.I.25 are available in electronic form.We assigned each word in the fragments an integer identification number (ID).The collection of the words and hence the ID numbers formed the dictionary.This allowed us to calculate vector of weights W(i), i = 1,…,9 for each manuscript.With these objects in hand the measures of distance were easily calculated.For example, the inner product of the length normalized vectors W(i)s gives the cosine similarity.All calculations were performed in Matlab.
We are aware of other possible classification algorithms that may be applied for the analysis of texts, such as (hierarchical) clustering models.However, this paper focuses on the vector space model and leaves other setups for future research.
Clustering may be problematic in this setup as the models may require larger text samples than studied here.As previously discussed, we were constrained by the length of readable text in the May Menaia.

Analysis of Manuscripts
For the vector space model analysis we used two fragments of stich from nine manuscripts of the Menaion described in Table 1.We separately analyzed the similarity of the manuscripts in lexical units and in both lexical units and grammar.For this purpose we created two dictionaries; in the first cases, the dictionary contained 70 terms, in the second 79 terms.In the first case we took into account lexical discrepancies, .In the second case, lexical and grammar discrepancies like were considered.As discussed in the previous section, we analyzed the two measures of distance -the Euclidean distance and cosine similarity.As a robustness check we calculated also the block measure -Manhattan distance -and obtained results similar to those of the Euclidean distance.The Manhattan distance was not chosen to be the primary measure of distance as it did not present straight line distance between vectors.
The separation of the analysis into lexical elements and lexical and grammatical elements proved important because it allowed to track the actual lexical variation and its relationship to certain manuscripts.Lexical and grammatical changes within a single linguistic unit showed the dynamics of some grammatical changes in diachrony and their connection with the lexical changes.
Consider the distance between the manuscripts when the lexical differences

2.
The manuscripts of the Studite type of the 11 th -14 th centuries -the main body of the CC; 3. The manuscripts of the early Jerusalem type described in the Preliminary List (PL) (1966), such as the manuscript of the May Menaion T.113 of the C14 th ; 4. The manuscripts of the Jerusalem type of the 14 th -17 th centuries, which are in most cases recorded in the lists of libraries.For example, manuscript KB Rålamb 4: 0 n: 0130 The Festival Menaion of the C17 th from the collection of the National Library of Sweden (Stockholm).
Netsunajev and Netsunajeva: On the Classification of the Slavic Menaia Manuscripts Dated from the 11 th to 14 th Centuries Art.1 , page 5 of 17 To represent the manuscripts in vector form, we use two fragments of stich , which follows the canon of the service for May 21 st in honor of holy Konstantin and Elena in PM and Q.п.I.25, but precedes the canon in the other manuscripts.As an example, we show the first fragment of text in three manuscripts: PM and Q.п.I.25, which refer to the archaic type, and manuscript BAN16.14.13, which refers to Studite type: L 2 norm to normalize the length).The larger the value of S(d i , d j ) the closer the manuscripts d i and d j are.
are taken into account.This distance is presented in Heat map 1.It is clear that PM and Q.п.1.25have a greater distance from other manuscripts, while the shorter distance is observed for BAN16.14.13.The resulting measure of the cosine similarity for the manuscripts when lexical differences are considered is shown inHeat map 2.Heat map 1: Euclidean distance between the manuscripts (lexical changes).The minimal value for each manuscript is highlighted in italics. of the lexeme in PM indicates the presence of another original text.Discrepancy (PM) -confirms this conjecture.The discrepancy comes from the Greek ὲndoxάzeindoxάzein .In the Greek text the lexeme doxάzomen appears and, as in the first case, is present in all manuscripts except PM.The manuscript BAN16.14.13 is very close to the manuscripts Sof.203, Sin.166, Sof.204, BAN4.5.10,Sof.382 and T.112.BAN16.14.13 and Sin.166 are the closest.Lexeme is present in all manuscripts except Sof.382, in which is a mutation of , and T.112, in which a grammar variation of the original lexeme appears.The overlap in the manuscripts of the second type is due to the use of the short form adjective .Therefore, it is possible to distinguish manuscripts PM and Q.п.1.25,which are separated from the others by lexical similarity.Cosine similarity indicates that these manuscripts are close.However, manuscript BAN16.14.13 has the largest number of similar elements to the rest of the group of the analyzed manuscripts.Previous studies based on a different methodology using textological and linguistic similarities concluded that PM and Q.п.1.25stand out as a special archaic type.The similarity of manuscript BAN16.14.13 with the rest of the studied group is the new result.The results on the distance between the manuscripts based on the lexical and grammatical differences are shown in Heat map 3. The red columns/rows reveal that PM and Q.п.1.25have the greater distance from the rest of the manuscripts.BAN16.14.13 is the closest to other manuscripts.These results are similar to those produced by the lexical variations analysis.Heat map 4 shows the results of the comparison of the manuscripts when lexical and grammatical differences are taken into account.Furthermore, we analyzed the lexical units and grammar of individual manuscripts to reinforce the results of Heat maps 3 and 4. The following observations are worth noting.1.The manuscripts Q.п.1.25 and PM have less similarity with the group of studied manuscripts: there is a difference in the choice

Table 1 :
Menaia manuscripts dated from the 11 th to 14 th centuries.