§1 In her entry for the Encyclopedia of Language and Linguistics, Lowe defines the aims of palaeography as follows:
The basic task of palaeography is to provide the means of dating and localizing manuscripts by establishing patterns in the development of characteristic letter forms and abbreviations. (Lowe 2006: 134)
As the entry recognises, abbreviations have been one of the most important means for dating and localising manuscripts available to palaeographers and book historians. In digital scholarship, however, they have been something of a sidetrack. Reasons range from many digital resources inheriting editorial traditions from print to the labour-intensiveness of rich XML transcription and normalisation being a prerequisite for many research questions. Moreover, to some extent, the focus of digital scholarship simply was elsewhere in the 2010s. Much work concentrated on achieving efficient results via lemmatisation, alignment, and digitisation. Furthermore, recent research in quantitative palaeography (e.g. Kwakkel 2012; Stokes 2014; Thaisen 2017) often focused on characteristic letterforms rather than abbreviations. Nevertheless, the importance of abbreviations has been acknowledged by many scholars (e.g., Hasenohr 2002; Driscoll 2009; Stutzmann 2014; Kestemont 2015). There have also been a number of quantitative studies since the 1980s that have uncovered interesting results, which this review seeks to highlight.
§2 The aim is to give an overview of scholarship into digital and quantitative approaches – taking into account English, French and German and, to a lesser extent, Dutch, Old Norse and Celtic scholarship. My hope is that the review will help anyone interested in the subject to find secondary literature and to identify trends and common interests. As many of these findings are in languages other than English, one motivation is also to help their dissemination into Anglophone scholarship. Furthermore, I hope to make a contribution to the theoretical description of abbreviations by placing them in the context of a typology of writing systems, drawing on writing systems research, which has been emerging as an important new field with contributions, such as Daniels and Bright (1996) and Cook and Ryan (2016).
2 Recognising the value of trivial or accidental variation
§3 While the importance of abbreviations is acknowledged by some scholarship, there are many other fields where there is a long tradition of treating them as a problem rather than evidence. The chief culprits are editorial practices and textual scholarship aimed to restore texts to their original form (Driscoll 2010). The twentieth-century editorial theory often focused on authorial “work” rather than scribal “text,” treating abbreviations and other scribal variation as “accidentals” (see e.g. Greg 1951), which were not seen as relevant for the authorial work contained in the manuscript copies. Much scholarship focused on the work and uncovering the work under layers of scribal copying and errors. Even when editions are based on a single manuscript witness, they very often contain tacitly expanded abbreviations, as they were considered to be a barrier to reading and understanding the text. The messiness of “accidental” variation was thus something which the editor attempted to get rid of. It has also been a problem for the requirements of digital scholarship.
§4 Text retrieval systems are typically unable to recognise different forms of the same word and the problem is usually solved by normalisation (cf. Kestemont 2015, 160). Many research questions are unlikely to be successful “on collections of texts that do not adopt the same orthographic standard” (161). These include topic modelling, computational semantics, text mining, and stemmatology as well as some fields of corpus linguistics such as syntax. Moreover, some of the research areas and questions that evoked most interest in pre-digital scholarship have been among the first to receive digital equivalents. For example, digital stemmatology inherits the outlook of more traditional approaches in which the focus is on features that are likely to be part of the “work.” However, some have questioned this, including Andrews and Mace who call into question “the utility of excluding “trivial” variation such as orthographic and spelling changes from stemmatic analysis” (2013, 504). Moreover, digital corpora are often based on printed editions and inherit their editorial practices, sometimes conflating several editorial approaches in the same resource (Honkapohja, Kaislaniemi and Marttila 2009, 456–460). In short, for several reasons, normalisation and the expansion of abbreviations are very much the norm in digital scholarship.
§5 The main problem with normalisation is, however, that while it is necessary for some research questions, it also discards a large amount of potentially useful data, which makes other types of research impossible. As Kestemont (2015, 161) puts it: “superficial textual variations also present important scholarly opportunities, for instance for the identification of scribes or the dialectological analysis of texts.” The problems caused by the loss of data, resulting from discarding variation considered accidental or trivial, has been discussed and criticised by several scholars (Driscoll 2006; Kytö et al. 2011; Kopaczyk 2011; Rogos 2011; 2012; Stutzman 2014; Lass 2004). This criticism is supported by the fact that there are numerous approaches which have demonstrated the value of scribal and other accidental variation.
§6 Making better use of “accidentals” or “trivial variation” is central to quantitative approaches to palaeography. It can also be linked to what can be termed as “new philology” or “material philology” (cf. Driscoll 2010). Material philology has been calling attention to recognising the uniqueness of each manuscript copy, including codicological features, such as writing support and level of decoration, but also all of the accidentals, such as punctuation, spelling variation, lexical variation as well as abbreviations. All of these are related: “orthography, palaeography, and codicology are overlapping realms of the archaeology of the book” (Thaisen 2011, 84). Nevertheless, there are decisions that need to be made on how much of them to take into account.
§7 The decisions are on which level to focus when studying variation. The focus of this review article is mainly on the graphemic level, as my concern is with different types of signs and their referents, not “allographic” variation between various shapes of the same sign (see e.g. Robinson and Solopova 1993, 20). In making this decision, my focus differs somewhat from much recent work, which has opted for differences on the “graphetic” level, recording allographic variation (ORIFLAMMS project, Rogos 2011; they also include studies of variant letter forms such as Thaisen 2011; Stokes 2014; Kwakkel 2012; Speed Kjeldsen 2013). The reason for this is that even though abbreviations are connected both to graphetic variation and the codes on the page, they are also part of medieval writing systems and made their orthography more complex than alphabetic writing. This complexity is something that deserves to be addressed as an important phenomenon in itself.
3 Abbreviations in the typology of writing systems
§8 What makes abbreviations unique, compared to many other features deemed accidentals by textual criticism, is the range in the way the written form corresponds to its lexical and phonological referents. Abbreviations belong to the “grey area” components of the medieval manuscript – which Traube called Halbgraphische Objekte (“half-graphic objects”) (Benskin 1977, 506; Römer 1993 and 1997; Rogos-Hebda 2018, 58). However, while the non-alphabetic nature of abbreviation is noted by several scholars (LAEME 2.2.1.; see also Benskin 1977, 506; Rogos 2011, 47; 2012, 7), it is often discussed in somewhat imprecise terms or with much variation in terminology. This can be partly explained by different emphases, as some approaches are concerned with allographic variation, codicological phenomena or morpho-phonemic variation. (For example, Mazziotta  aims to describe abbreviations from a graphetic perspective seeking to model the strokes made by the scribe. Rogos [2011, 2012], Rogos-Hebda , Cottereau  and Cottereau-Gabillet  discuss abbreviation in addition to range of other bibliographic codes such as initials. LAEME is considered with phonology and morphosyntactic description.)
§9 The situation is not helped by using different names for the same phenomena in different fields of scholarship. As Römer (1997) notes, where linguists may speak of apocope, palaeographers speak of suspension (11). If we, however, focus precisely on the signs and practices used to abbreviate words in mediaeval writings and their lexical and phonological referents, it is also possible to describe them in terms of writing systems research and terminology developed by scholarship describing the emergence of writing systems in the ancient Middle East and shores of the Mediterranean.
§10 “The main issue” with classifications of writing systems “is how to reconcile the two levels of language that written symbols correspond to […] the lexicon, whether words or morphemes […], and on the other hand sounds in the phonology, whether syllables (Japanese kana), phonemes (Italian) or consonant phonemes only (Arabic)” (Cook 2016, 6). The following account uses terminology used in a pioneering monograph by the Polish-born Assyriologist I.J. Gelb (1952), which aimed to describe the development of writing systems from a typologically comprehensive perspective (for a very comprehensive resource, see Daniels and Bright 1996; for a clear, if somewhat contentious, introduction, see Powell 2009).1 The study introduced a “tripartite scheme of logography, syllabary, alphabet” which has become the main typology used in writing systems research (Daniels and Bright 1996, 8). Abbreviations too can be described by this framework, which Gelb (1952, 12–13) already notes. (Gelb’s  example, the following sentence contains examples of all three: “Mr. Theodore Foxe, age 70, died to-day at the Grand Xing Station” . As Gelb notes both the numeral “70” and the abbreviation Mr. for “Mister” are logograms, whereas “the rebus-type symbol X plus alphabetic ing” stand for the word “crossing” [ibid.]). The typology works well, if one takes it as a somewhat flexible system and does not expect scripts to fall neatly under any category (Daniels and Bright 1996, 8). The placement of abbreviation symbols in this typology is illustrated by Figure 1.
§11 Writing systems can be divided into phonography, in which signs are tied to sounds, logography, in which signs are tied to words, and semasiography, which is a broad umbrella term for symbolic systems that communicate meaning without being directly tied to natural language.2 Phonographic writing systems can be further divided into syllabography and alphabetic writing. Syllabic systems refer to the basic building blocks of spoken utterance: syllables. They include Japanese kana or Korean hangul. Alphabetic systems link to vowel and consonant sounds; “a writing system in which each symbol corresponds to a particular sound of the language, and, vice versa, each sound corresponds to a symbol, is called ‘transparent’or ‘shallow’” (Cook 2016, 7). A system like the International Phonetic Alphabet (IPA) gets very close to complete transparency, and smaller European languages like Finnish or Icelandic have fairly shallow orthographies, but the spelling systems of languages like English and French are less transparent. For example, English orthography carries such historical baggage as the “silent” final “gh” in eight, night, through and there are homophones such as dear/deer –in which, Laing argues, the distinction indicated by spelling convention is leaning towards logography (LAEME 2.2.1). In logography, signs are attached to entire words. Chinese script or Egyptian hieroglyphics are more logographic, even though both have phonographic connections (Powell 2009, 188). It is, however, also possible for written symbols to communicate meaning without having a direct phonetic or lexical referent.
§12 I call writing systems which are not connected to natural language semasiographic, following Gelb (1952) and Powell (2009). Semasiography can be defined as “material marks with a conventional reference” that “communicate information without the necessary intercession of forms of speech” (Powell 2009, 32). Semasiography is a broad and heterogeneous category which includes systems that predate phonographic and logographic writing, but also still exist beside them. Modern semasiographic signs include, for instance, emoticons, computer icons and traffic signs. Semasiographic systems can be complicated and precise, as they also include, for example, mathematical and musical notation. Yet, it would be impossible to write this article in mathematical or musical notation without giving the numbers or notes some kind of phonographic or logographic referents.
§13 Western scripts are predominantly alphabetic, but they also routinely contain characters that are syllabographic or logographic. An important point about the classification, which Gelb (1952) already notes is that the signs can be used flexibly depending on the context (16–17). For example, the heart shape (♥ or <3), which can carry meanings such as indicating one of the four suites in a standard 52-card deck (Figure 2) or communicating love or other strong emotion as an emoticon, falls under semasiography. In examples 1 and 2, it is however used as a logogram, but referring to different words, which most English speakers have no problem parsing as “love” and “heart.”
- I ♥ NY
- My ♥ belongs 2 U.
What is more, example 2 contains two modern abbreviations categorised as letter/number homophones by Bieswanger (2007, 5), which are used outside their normal category (see also Anis 2007). The Hindu-Arabic numeral 2 is normally a logogram, but in the example, it is used as a syllabogram due to phonological similarity /tʰuː/ with the English, single-syllable, preposition to /tʊ/. A second example of the same is the letter “U,” one of the letters of the Latin alphabet, but here serving as a syllabogram as its name, when read out aloud in English is homophonous with the single-syllable second person pronoun “you” /juː/. The fact that these are syllabograms rather than logograms is evident, as you need two signs to write a two-syllable word such as “before”: B4. It is thus very much possible to use this system to describe abbreviation: modern as well as medieval.
§14 Medieval abbreviations can be easily described using this same typology. Some abbreviations (9 “-er, -ir, -re” in aft9) are syllobograms consisting of vowel and consonant combinations (often “s” or “r” in combination with a vowel). Others (S. for “Saint” – or indeed modern acronyms) are logograms, as they correspond to the entire word. Within logograms, Laing makes a further distinction between “impure” logograms, which contain some phonographic cue, such as “that,” S[aint], writing common religious phrases with initials only (LAEME 2.2.1). On the other hand, Laing considers the use of Greek abbreviations in Latin such as xpc for “Christ” and iħc for “Jesus,” or the ampersand, to be pure logograms, as they are tending towards abstraction as they are “not subject to phonological extrapolation” (LAEME 2.2.1). Even if they are made up of letters of the Greek alphabet and were often reproduced by the scribes using more familiar Latin shapes, they do not have these values (LAEME 2.2.1).
§15 An important point to make about abbreviations, though, is that even though they may function like syllabograms or logograms, they are additional and alternative signs to an alphabetic system, not a new form of syllabic or logographic script to represent spoken utterances. Kopaczyk (2011, 96) notes that abbreviations should be treated as a sequence of letters rather than syllables, which is evident from examples such this co[ur]t or p[er]sone (Kopaczyk 2011, 96, citing an example in Roberts 2005, 11). Driscoll (2006, 14) gives another example, Þīg “Þi[n]g.” The scribe appears to have been thinking of the nasal sound [ŋ] as consisting of the digraph “ng” like it is represented in the Latin alphabet. Rather than a new syllabic script, abbreviations are thus an elaboration of the Latin alphabet. They belong with other auxiliary devices, such as diacritics, capitalisation and punctuation.
§16 Two more useful pieces of terminology are relevant for describing abbreviations as auxiliary devices in writing systems. Powell (2009, 42) makes a distinction between phonetic complements and semantic complements. An example of a phonetic complement would be the addition of letters, such as 2nd to indicate the ordinal “second.” In the medieval system, superscript abbreviations (wt “with”) typically function as phonetic complements. Semantic complements or determinatives, on the other hand, specify the meaning of a sign with an additional mark. Latin letters such as a bar though the ascender of a “crossed-p” are semantic complements indicating that the letter stands for the abbreviation for “per, par, por” in Latin or English. The term semantic complement is also useful for those mediaeval abbreviations which Cappelli (1990 , xxiii–xli) calls abbreviation marks significant in context rather than abbreviation marks significant in themselves. (Cappelli’s terms are in Italian Segni abbreviativi con significato proprio [xxiii] and Segni abbreviativi con significato relative [xxix]. I am using the English translations by Heimann and Kay .) Some medieval abbreviations, such as punctus or the horizontal bar based above the abbreviated word, simply indicate the presence of an abbreviation. These signs are best described as semantic complements rather than logograms or syllabograms.
3.1. Problems related to expanding syllabograms and logograms
§17 Some of the problems related to expansion of abbreviations, practised in many fields, are caused by alphabetising texts written in a writing system whose orthography allowed also non-alphabetic characters. Orthography can be defined as “the rules for using a script in a particular writing system, that is to say how the symbols spell out words etc” (Cook 2016, 6). The orthography of Medieval Latin and vernaculars permitted the use of a sometimes large number of syllabograms and logograms – the number of Latin abbreviations can be more than 50 per cent of the word count (see, e.g., Honkapohja and Liira, 2020; Stutzmann 2018); in vernacular English or French the number may be as high as 30 per cent. In Old Norse, the number could exceed even mediaeval Latin (Driscoll 2009, 13).
§18 The inherent difficulty with expanding abbreviations lies in the fact that logograms and syllabograms are less phonetically transparent than alphabetic script. A mediaeval text which contains logographic and syllabographic characters mixes writing systems. Such a system can, following Mazziotta (2008, §2.1.1), be called plurisystème, a “multi-system.” (Even thought, strictly speaking, Mazziotta was referring to his own scribe-oriented palaeographical model of abbreviations). If the abbreviations are expanded and represented by characters of Latin script, a modern editor assigns a definite alphabetic and implied phonetic value to them (Rogos 2012, 7). The edited text becomes a mixture of scribal and editorial language. The problem is greater with medieval vernaculars, whose spelling was notoriously varied, in contrast to the relatively homogenous spelling of mediaeval Latin.
§19 The usual solution is to use the scribe’s most common unabbreviated form as the expansion (cf. Page 1960: xxxii; De la Cruz-Cabanillas and Diego Rodriquez 2018). However, this may seriously distort the frequencies. For example, Smith (2018: 192, Table 9.1.) found in her Older Scots data that the scribes abbreviate the plural of nouns with 61 per cent of time (e.g., part “parts”), followed by “-is” (20%), “-es” (11%), “-ys” (5%) and “-s” (5%). This is important as the practice of spelling the plural with “-is” or “-ys” (partis or partys) instead of “-es” (partes) is considered diagnostic of Older Scots in contrast to more southern dialects. An editorial approach that expands all abbreviated plurals following the most common spelling will lead to showing “-is” as the clearly dominant variant with 81%, when in reality there is no certainty which form the scribe intended (see also Czajkowski 2018, 96).
§20 De la Cruz-Cabanillas and Diego-Rodriquez (2018, 172), on the other hand, give a few Middle English examples of words whose expansion may have dialectal significance for the tradition of Middle English dialectology in which spelling variation is used to localise texts. For example, the word man, if abbreviated with a bar, may be interpreted as otiose, but may also indicate abbreviation. The form man̄ can be expanded either as “man,” “mane,” “mann” or even “maun” (if the two minims are interpreted as a “u”). (All of these forms are recorded by the Linguistic Atlas of Late Mediaeval English [eLALME, 2013 (1986)]). The expansion matters, because man is a very widely used form, whereas mane or mann are much more restricted and can be used as diagnostic forms (cf. Cruz-Cabanillas and Diego-Rodriquez 2018, 172). Consequently, there is often no certainty of how a certain abbreviation should be expanded – mixing writing systems that function on different levels can lead to problems like these.
3.2. Logography and language independence
§21 A further feature of logograms is that they are less tied to a single language than phonographic writing. For example, the Hindu-Arabic numeral 2 can be read out aloud in any language. An English person would expand it as “two,” a French-speaker as “deux” and a Finnish speaker as “kaksi.” Abbreviations, including less straightforwardly logographic ones, can sometimes too be expanded in several different languages. As Voigts (1989, 91) mentions, “[n]o educated reader is perplexed by e.g., i.e., or even viz., but it is by no means certain that even the latinate individual actually thinks exempli gratia, id est, or videlicet when he reads or writes those letters.” There is some contemporary evidence that scribes could associate Latin abbreviations with the vernacular. The following examples are from three closely related copies of a plague treatise in medical manuscripts from the 1450s and 1460s (Honkapohja 2017, 136). The scribe of Sloane 3566 writes English “that is to say” where the other scribes use the Latin abbreviation s. for “scilicet.”
- The hede veyne lyeth aboue þe body veyne s. cardiac. (Trinity, f. 80v)
- The hede veyne lieth aboue the bodye veyne s. Cardiak. (Boston, f. 48v)
- The hede vayne lyeth a bove þe body vayne þat is to sey þe cardiake. (Sloane 3566, f. 97v)
[The head vein lies above the body vein, namely, cardiac]
§22 A number of studies have uncovered evidence that this language-independent quality was sometimes utilised in historical texts. Hector (1958, 37) mentions that English proper names in Latin documents could be “terminated by a mark of suspension to preserve the fiction that they were declinable Latin words.” It has even been argued that the language-independent quality of abbreviations may have been used on purpose.
§23 Abbreviated words that can be expanded in several languages are called visual diamorphs by Ter Horst and Stam (2018, 234),3 who focus on Latin and Gaelic. According to them, “[s]ome words are abbreviated ambiguously, and they can consequently be resolved as both Latin and Irish. One example is the title aps., which can stand for “apostle” in Latin, apostolus, or Irish: apstal” (Ter Horst and Stam 2018, 234). Wright has studied abbreviations in English/Anglo-Norman/French mixed-language documents (see e.g. Wright 2002, 2011 and 2018). The author stresses the importance of abbreviations in the complicated contact situation during the late Middle English period, suggesting that the abbreviation system may be part of the reason, as it can be used to suppress the language-specific grammatical endings and highlight the stem. For example, a word such as argent9 can be read as Latin (e.g. argentem, argentis), Anglo-Norman French (argenté), or Middle English argent “(a) Silver, silver coin; (b) her. Silver-coloured, silver-gilt” (MED). Czajkowski (2018, 90) notes that abbreviated forms of pronouns can be used to suppress differences between High German and Low German forms of personal pronouns, as they can be expanded both to Low German “he,” “we” and “unse” or High German “er,” “wir” and “unsere.” Consequently, abbreviations have an important function in language-independent communication due to being logographic, and there is evidence that this quality was sometimes exploited in pre-modern multilingual texts.
§24 The importance of abbreviations for multilingual practices is ignored by some digital approaches. King, Kübler and Hooper (2015) use automatic language identification on the Chymistry of Isaac Newton, an online resource which painstakingly encodes his alchemical symbols (Walsh and Hooper 2012). The approach used by King et al. is based on character n-grams. One of their aims is to identify language switch-points. Yet, they discard, among other things, the “large number of conventional alchemical symbols” used by Newton as “non-words” (536). While this is understandable from the point of view of the algorithm, as it operates on English written in the Latin alphabet, it does not take into account research into code-switching in which signs tending towards logography and semasiography can function as visual diamorphs and switch-points. Approaches to historical code-switching could be augmented by taking abbreviations into account.
4. Conditioned and unconditioned abbreviation
4.1. Conditioned variation
§25 An important point about abbreviation is that the abbreviated form is an alternative to the expanded form, which is used when a shorter form is needed under certain conditions. Many sources note this, albeit using different terminology and having slightly different theoretical emphases. Mazziotta (2008, §18) discusses how the two forms are independent signs linked by “synonymic relationship” which can be linked to “extralinguistic constraints.” Wittberger-Markwardt (2018, 68) and Nievergelt (2017, 231) call this relationship “lexical equivalence.” They further use the terms Substituendum, which refers to omitted full graphemes, and Substitutum, for ones that replace them (Wittberger-Markwardt 2018, 72). Samuels (1983); Thaisen and Da Rold (2009) and Thaisen (2011) use the terms “short form” and “long form,” including in short forms also shorter spellings (for example, do would be short form, but doe “do” or doo would be classified as long forms). They too note that shorter forms are used as an alternative for the longer ones. What all these authors agree upon is that the use of the shorter variant is subject to certain conditions.
§26 Variation that is conditioned by the surrounding environment could be seen as analogous to what in phonology is referred to as conditioned variation. I owe this observation to discussions with my former office mate Dr Raffaela Baechler (personal communication).
§27 Conditioned variation (2007) refers, for example, to sound changes, such as assimilation due to ease of articulation, in which a sound is affected by the phonological environment of the surrounding sounds. To use an example from the Concise Oxford Dictionary of Linguistics (2020), “the [ɱ] of comfort can be described as a conditioned variant, in the position before a labiodental fricative, of the unit realized by another variant, [m], e.g. in bumper.” Thus, phonological variation is conditioned by the phonological environment of the surrounding sounds. Writing an abbreviated form instead of the full form is conditioned by the need to save time or space.
§28 Abbreviation is undertaken to save space or time, which Petti (1977, 22) calls economy of space and economy of time. Using an abbreviated form instead of the full one trades off some intelligibility (Avi-Yonah 1940, 9; Bozzolo et al. 1990; Cottereau 2005, 623), but it has the advantage of fitting the needed message into a smaller space, which could be important for reasons of conserving parchment or a carving surface of metal, stone or wood. Economy of time, on the other hand, was relevant throughout Antiquity and the Middle Ages, since “[t]he commonest way of committing words to writing was by dictating to a scribe” (Clanchy 1979, 97).4 In addition to these two, some authors, such as MacLean (2002), whose focus is on Greek epigraphy, mention saving labour. As he puts it, “abbreviations were used as a means of reducing labor and saving space on the stone’s surface” (2002, 49). Whatever the reason, the question then becomes whether quantitative palaeography can be used to study whether abbreviation was conditioned by something we can measure.
§29 Saving time may be somewhat difficult to establish hundreds of years later. It is impossible to measure how much time the scribe used to write a particular character, and there are many variables such as the type of script and its grade (Cottereau 2005, 623). Some scholars attempt this. Thaisen (2011) studies the use of shorter forms, including abbreviation, in copies of Canterbury Tales. He argues that they are more frequent in stretches of text which the scribe seems to have “completed late therefore tallies with the impression of the codicologists and palaeographers of a scribe finishing his work at some speed” (Thaisen 2011, 80). Cottereau (2005, 625) argues that conventionalised abbreviations, such as the nomina sacra as well as abbreviations of small function words could be associated with economy of time (Cottereau 2005, 628–629). There is thus arguably some evidence for economy of time in historical data. Nonetheless, it is difficult to measure and may be something that has to be argued for in the traditional qualitative way of making an argument.
§30 Economy of space, on the other hand, can be quantified and several studies have by now identified ways of measuring variation due to physical constraints. One of the easiest to quantify is simply the size of the manuscript in relation to abbreviation frequency. Not surprisingly, a number of studies that compare abbreviation frequencies to manuscript size (e.g. Cottereau 2005; Cottereau-Gabillet 2016; Thaisen 2011, 79; Honkapohja 2018, 250) have found abbreviation is more common in smaller manuscripts.
§31 Codicological units smaller than manuscript may also be constraints that contribute to abbreviation frequency. The need to conserve space may be particularly pressing towards the end of a quire, a page or a line. Camps (2016) tested for all of these variables but only found a fairly weak correlation for page and quire end. He found a more significant one for the end of the line. Shute (2017), who studied spelling variation in early printed books by William Caxton, using cluster analysis, noted that abbreviations are more likely to occur close to the right margin, and the results are statistically significant. The same phenomenon was also noted by Bozzolo et al. (1990, 23) as well as Cottereau (2005, 625–626), who also noted that the final word of the line is always more likely to be abbreviated than the first word of the line (627). She also found that the type of abbreviation may vary in different positions, as more specific abbreviations were more common towards the ends of lines as opposed to ones realised by simple nasals. There is thus ample quantitative evidence for the importance of the end of the line as a conditioning variable for manuscript abbreviations.
§32 Going further down in specificity, we get to the context of individual words and letters. Things such as the length of the word and the number of syllables may have an effect of the use of abbreviation. Cottereau (2005, 653) found that final syllables were more likely to be abbreviated by more complex abbreviation signs, whereas the simple horizontal bar was more common in earlier syllables. One further discovery related to conditioning variables is the context of the preceding letter, made by Smith (2018, 205–206, 208, Figure 9.7), who found that the preceding letter form is the best predictor for word-final : it occurs after k, r, g, t and c and not after h, n, m, l, s and p. Letter forms which terminate in a horizontal stroke are likely to be followed by the abbreviation, whereas ones terminated by lobes or vertical strokes are less likely followed by the abbreviated plural.
§33 Perhaps the most thorough consideration of economy of space comes from the pioneering French-Italian group Quanticod, who published a number of studies in quantitative diplomatics (see e.g. Bozzolo et al. 1990, 22). They approached abbreviation on the one hand with a detailed break-down based on what we here call conditioning variables, such as the position of the word in the line or the position of the syllable in the word, and on the other with a functional framework which differentiated between two types of economy of time (reading and writing) and economy of space. One of notions they proposed was agréabilité “agreeability” of certain syllables, which means that some types of syllables were more suitable for abbreviation than others. The work of Quanticod influenced later French scholarship, including Hasenohr (2002) as well as Cottereau-Gabillet (Cottereau 2005; Cottereau-Gabillet 2016) and Camps (2016), who have been able to demonstrate its suitability.
§34 There is a chance that agreeability was governed by something resembling prescriptive conventions. Quanticod was influenced by one fifteenth-century copy of De laude scriptorium by Jean Gerson, which contains an appendix called Quedam regule de modo de titulandi. The treatise contains what appear to be one fifteenth-century scribe’s “guidelines” or “stylesheet,” giving instructions, such as which abbreviations should be used in high grade books such as the Bible (Caen, Bibl. Mun., Coll. Mancel, ms. 131). It lists which abbreviations should be used for which syllables and where in the word they should be used.
§35 Further examples of abbreviation guidelines survive in early printed books. For example, German Schryfftspiegel (1527) also lists abbreviations by syllables. It instructs that abbreviation should never be used for majuscule and for minuscule only “in need” (“in der not”) at the end of the line (Römer 1997, 12–13). (Römer  also lists a number of other German early printed books which give instructions on the use of abbreviations.)
§36 To sum up, abbreviated forms are alternative spellings for fuller forms that were used under certain conditions: saving space, time and effort. Time and effort are difficult to study with quantitative precision, but saving space is one of the easier things to quantify. It is something where digital approaches have much to add to the argument. Abbreviation can then be analysed quantitatively by tagging and using as variables the context of the position in the line, the preceding and following, position in the word, the manuscript page and the quire, and performing a statistical analysis. If such causes can be pointed to as the reason why a scribe used an abbreviation, we are dealing with conditioned variation. If no clear, immediate reason for using an abbreviation rather than a full form can be found, we must ask the question of whether there are any other reasons which might have led to its use.
4.2. Unconditioned variation
§37 Economy of time and space do not, however, account for all mediaeval abbreviation. For example, Camps (2016, ccliv) notes that by carrying out a regression-using model which takes the ends of quires, pages or lines into account uncovers a number of weak correlations but leaves much variation unexplained. To continue with the analogy to phonology, such variation could be called unconditioned.
§38 Unconditioned or spontaneous variation in phonology refers to variation that cannot be attributed to the immediate phonetic context of the word. “Phonological variation is often studied from a sociolinguistic point of view, i.e. by examining the use of variants […] such as sex, age, style, register and social class” (Anttila 2002, 206). These interact with the conditioned internal factors “such as phonology, morphology, syntax and the lexicon,” which is “what makes it interesting to the phonologist.” Similarly, once we have sorted the internally conditioned variation caused by saving space, there could be much that abbreviations can tell us.
§39 Unconditioned variation for manuscript abbreviations could encompass variables related to geographic or diachronic variation or the socio-historical production circumstances of the manuscripts. Scribes from a certain area may have acquired their abbreviation practices from some local writing centre, such as a monastic scriptorium or an administrative office responsible for certain kinds of documents. It could also include diachronic change. Abbreviation practices change over time to adapt to different circumstances of text production, different writing materials, possibly even changes in reading practices, when literacy moved outside monasteries and spread to wider sections of the reading public.
§40 There have nevertheless been a number of quantitative surveys of abbreviations which have produced interesting results. Two of the earliest studies are by Bozzolo et al. (1990) and Hälvä-Nyberg (1988). Hälvä-Nyberg (1988) studied Latin inscriptions from Rome and Africa; extending from the earliest surviving sources to the eighth century, the corpus is selected to cover the switch from Roman pagan traditions to Christian ones. Bozzolo et al. (1990) studied fifteenth-century liturgical manuscripts. This was followed by Hasenohr (2002), who focused on French and Latin from the seventh to eleventh centuries. Römer (1997) contains a thorough quantitative description of a range of German manuscripts from southern Germany, Switzerland and Austria, starting from 1300. A recent project at the University of Zurich (see e.g. Wittberger-Markwardt 2018, Nievergelt 2017) focused on Old High German glosses. These should all provide some coverage of abbreviation practices in various areas and in various times, although they are not fully comprehensive.
§41 The question of how various socio-historical variables could influence the types of books copied was studied by Cottereau-Gabillet (Cottereau 2005, her PhD; Cottereau-Gabillet 2016, which contains some of the results in English). She studied hundreds of manuscripts from Paris libraries, which were selected with the criterion that the name of the copyist is known. Using sociohistorical variables in a statistical enquiry, she found that abbreviation is less frequent in higher status copies. Higher abbreviation density can be predicted by the type of patron or higher “grade” of manuscript as more luxurious copies would have fewer.
§42 While French and German studies have been much more comprehensive with respect to abbreviation, the English and Scots historical dialectological tradition can make use of corpora based on manuscripts localised based on their language. These can be used to study abbreviations with respect to geographical variation. Such studies include Smith (2018), who found that the abbreviation “-is” is more common in more densely populated boroughs of Scotland: the scribes who copied more, were more likely to use it. Honkapohja (2019a and 2019b), on the other hand, discovered that there are a number of abbreviations specific to West Midlands counties in the Early Middle English period.
§43 There may also be differences within writing traditions that could be examined. Hasenohr (2002, 82–83) proposes that there is a major difference in monastic and scholastic abbreviation practices. Monastic writing was a slow and contemplative process. Scholastic abbreviation would include using more logographic abbreviation. Camps (2016, cclv) notes a number of interesting possibilities for this, proposing that abbreviations would peak with thirteenth-century scholasticism.
§44 Abbreviations can also give instructions on how to pronounce words in a number of ways, it has been argued. For example, Hasenohr (2002, 92) discusses superscript abbreviations in French in connection to pronunciation and whether the words were single or polysyllabic in Latin and French. N.R. Ker cites a manuscript in which the Latin word neque “neither,” which is expanded, has been consistently corrected to neq. He argues that the reason was stress in reading out aloud, as writing the word in full would mislead one to stressing the second syllable (cf. Clanchy 1979, 217; Ker 1960, 51).
5 Scribal profiles
§45 One of the most important uses of abbreviations in palaeography is for dating and localising scribal hands; for example, Ludwig Traube (1902) noted that when he is looking into the date of a manuscript, he immediately turns to the abbreviations. To give a practical example, Ker (1960, 54) identifies one scribe as Norman, because he expands an Old English abbreviation () as “hoc” instead of “autem.” This leads to the question of whether abbreviations could be used digitally for the identification of scribal stints using methods developed for stylometry, compiling a scribal profile.
§46 There are pre-digital attempts to theorise and systematise the study of scribal accidentals. In the English tradition, this was proposed by McIntosh (1975), who hypothesises that every Middle English scribe is likely to have a unique linguistic profile (LP) of spellings and grammatical forms and also a graphetic profile (GP), of “linguistically sub-systemic […] phenomena.” In French philology, Cazal, Parussa, Pignatelli and Trachsler (2003) have developed an equivalent system. A unique profile of graphetic elements is thus something that one would expect to be very well suited for computational analysis. There have been a number of interesting and encouraging results using digital approaches to scribal profiles, for example, by Thaisen (2020), who was able to use probabilistic modelling to identify scribal stints, de Brujin and Kestemont (2013), who used n-grams and contrastive multivariate analysis to a Middle Low German chivalric romance.
§47 Out of the quantitative approaches to scribal profiles, there are few which make use of data which encodes abbreviations. Speed Kjeldsen (2013) notes differing abbreviation use by different scribes (391, 404). Kestemont (2015) studied scribal profiles, including abbreviations, using Principal Component Analysis for a richly encoded XML transcription of a letter collection of the Middle Dutch mystical female poet Hadewijch. He identified a change of scribal hand, based especially on the use of abbreviations, as “tilde abbreviations seem more common in the first part of the copy in ms. A,” while “B adopts a more abbreviation-rich orthography than A for this part of the letters.” (Kestemont 2015, 172–173). The result was thus that even though abbreviations were not the predominant focus of the study, they emerged as the distinguishing factor. This result, along with the discoveries based on graphetic variation, proves that there is definitely potential in the use of abbreviation as means for digital scribal profiling.
§48 A potential area of enquiry might be scribal profiles and high-frequency abbreviations. Kestemont (2015) noticed one scribe’s propensity for using the very all-purpose tilde-abbreviation, which one scribe used much more than the other. Honkapohja (2019a: Figures 5–7) found that abbreviation in LAEME consists predominantly of certain types, especially the macron and hook. Camps (2016, cclvi) proposed that the best approach for studying similarities and differences between manuscript witnesses of the same text would be to focus on certain abbreviation types. A quantitative profiling of these high-function types is something which definitely deserves more investigation.
§49 An alternative would be to focus on function word and content words. Abbreviations seem to have been particularly frequent for function words. For example, Hasenohr (2002, 80) notes that most of these abbreviations were created in the twelfth and thirteenth centuries for use in cursive writing; they mainly affect the endings, the adverbs, particles and pronouns, as well as the forms of the verb esse –words “which come often under the pen” (qui reviennent souvent sous la plume). Rogos-Hebda (2018, 55) notes differences in the use of abbreviations by two Chaucer scribes. Honkapohja (2018, 267–8) notes that scribes are more likely to copy lexical words directly, but have individual profiles for function words. A focus on function words has parallels with developments in stylometry in which function words were considered to be the best way of establishing authorship profiles, before n-grams proved to be more efficient (Kestemont 2014, 62). Nevertheless, with abbreviations the use of function words has not been compared to n-grams, which are also partly popular as they can be used with plain text (De Bruijn and Kestemont 2013, 182). Perhaps a rich transcription in which abbreviations are encoded could complement the study of scribal stints by n-grams. Or perhaps a transcription in which abbreviations are encoded could itself be subjected to n-gram based analysis.
6 Encoding abbreviations
§50 In order to study abbreviations quantitatively, we need to count them somehow. While earlier studies collected their dataset manually, these days the standard is TEI P5 XML. TEI P5 is a flexible framework divided into modules, which provide a number of alternative ways to encode abbreviations. As we have now had a few decades’ worth of projects using TEI XML, the problems, solutions and theoretical implications of dealing with different types of abbreviations have been discussed in many sources for different languages and document types (cf. Heiden et al. 2002; Driscoll 2006, 2009; Mazziotta 2008; Stutzmann 2010, 2014, 284; Speed Kjeldsen 2013, 34–38; Honkapohja 2013a; Horst and Stam 2018). Some of them decide to encode abbreviations in their resources and, as a by-product, publish an article or guidelines detailing their editorial choices. Moreover, individual projects and even individuals working within the same project may encode the same abbreviation differently. The aim of this paper is not to argue in favour of any particular encoding choices, but to function as a review article that will help readers to navigate these waters and also to outline how the theoretical contribution made in the previous section fits with two prominent encoding systems. However, one particular problem deserves highlighting as it is connected with the writing system cline discussed in section 3, as it offers a greater theoretical clarity on it.
§51 The theoretical distinction outlined in section three fits in well with the discussion by Driscoll (2006, 2009), which is influential not only in Scandinavian but also Anglophone scholarship. Some abbreviations replace a string of characters, while others are more logographic. The type of the abbreviation and whether it has a clear referent in a string of Latin alphabetical characters will affect the “editorial policy” on how it is best to be encoded. It is necessary to distinguish between abbreviations that refer to the entire word and ones that correspond to a sequence of characters. Driscoll (2006, 259) calls these abbreviations “with a graphemic reference (superscript letters and signs and the remainder of the brevigraphs)” and those “with a lexical reference (suspensions, contractions, and a number of brevigraphs).” He goes on the say that “[i]t strikes one as counterintuitive to treat the former on anything other than the whole-word level, while treating the latter in the same way seems equally misconceived.” This distinction corresponds exactly with (attempted) phonographic and logographic writing systems (see Examples 1 and 2).
§52 A different solution is proposed by Mazziotta (2008) and discussed by Stutzmann (2010 and 2013). This approach advocates a different way of seeing abbreviations from the point of view of the writing system, focusing on the work of the scribe. The system seeks to model the scribe’s strokes in a “descriptive” way based on their position in the “graphic space” without reference to the way they are vocalised in a phonological word (Mazziotta 2008, §13, §18). Mazziotta divides them into a few axes (§19) and proposes terminology for describing various relationships (cf. Mazziotta 2008, §18–§49). For example, the very common “crossed-p” abbreviation is not seen as a single sign, but rather the letter p modified by a cénégram (Stutzmann 2008, 265–66). The solution he proposes is elegant, but somewhat complex (it results in quite a bit of terminology, something which writing systems research is notorious for (Powell 2009, 263); Mazziotta uses the terms périgramme, linégramme, cénégramme, caténogramme, logogram and plérégramme) and, as Stutzmann (ibid.) notes, goes against the encoding philosophy of MUFI and Unicode. It does have advantages for encoding abbreviations, as it removes the common transcription problem of trying to fit into exact encoding boxes something which was a single stroke for the scribe. That is, in ambiguous cases, whether a horizontal bar which crosses through, e.g., a tall ascender for l but also simply sits on top of other letters, is a “macron” or a “crossed-l.” Encoding based on these theoretical conceptions is discussed by Mazziotta (2008, 3.2.1–3.2.2) and Stutzmann (2010 and 2013). In the form applied by the ORIFLAMMS (2020) project it would be as follows. The encoding would be like the one presented in Example 3.
§53 The views presented in this article are not incompatible with the approach advocated by Mazziotta and Stutzmann. The various types of combining marks fall under what Powell (2008, 45–46) calls “auxiliary signs and devices,” placing them in the same group with features such as capitalisation, diacritic marks or punctuation, which are established conventions developed on top of the writing system. On the other hand, the discussion is on a different level. This paper deals with the graphemic level and focuses on the reader. Mazziotta’s system models them from the point of view of scribal practice. He even makes it clear that his concern is not with the psychology of reading (2008, §46). ORIFLAMMS (2020) treats them in addition to various graphetic features. (Moreover, as Mazziotta admits his theoretical approach solves abbreviations from a theoretical point of view of strokes made by the scribe described in relation to other strokes. However, it is not fully compatible with the psychology of reading – as abbreviated forms become kind of logograms. Experienced readers are likely to read on word-level rather than individual graph level. Something which Hasenohr suggested as a transformation from the contemplative reading of monasteries to the logographic reading in universities). While there are advantages to graphetic transcription, such as being able to carry out statistical investigations without presupposing a writing system, the question of referent for abbreviations is more complex than for some other features and is of linguistic interest also from the point of view of where these features fit in the typology of writing systems from a graphemic point of view.
§54 One further problem related to encoding abbreviations derives from the language-independence of especially some logograms (see section 3.2 above). Ter Horst and Stam also discuss encoding visual diamorphs in XML. Among other things, the @lang can be used to specify which language a certain abbreviated word belongs to. The mark-up for visual diamorphs is addressed by Ter Horst and Stam (2018), who add a tag for words that are part of both languages. “The preferred method for signalling code-switches in XML is the language-attribute (@lang=""). Apart from the standard value for Latin ("la") and Irish ("ga," for Gaelic) we added the custom value ‘ga-la’ for visual dimorphs” (Ter Horst and Stam 2018, 224). TEI Guidelines are thus well-suited for handling not only the logography/phonography variations, but also visual diamorphs.
7 Suitable resources for the study of abbreviations
§55 Studying abbreviations quantitatively requires digital transcriptions that encode the forms of abbreviation as well as their expansions. While many earlier corpora were based on printed books, the situation is improving. The numbers are not nearly as high as corpora that do not encode abbreviations or manuscript images online (Robinson 2016, 182–7), but there are now several resources available that offer diplomatic transcriptions that indicate abbreviations in a format suitable for quantitative analysis.
§56 Editions of individual works include Murchinson (2017), Camps (2016), Dunning (2016), Honkapohja (2013b) and De Leeuw van Weenen (2009). All of these make use of the TEI architecture for abbreviations, and allow toggling between diplomatic view and normalised view, which means the abbreviations are available to the user. Larger corpora that encode abbreviations in a way that enables applying quantitative methods include texts made available by the Medieval Nordic Text Archive (MENOTA), a network of libraries, archives and research departments of Old Icelandic, Old Norwegian and Old Swedish Texts. The project makes use of TEI P5 XML. Another one is the Base de Français Médiéval (BFM 2020) which contains French texts from the ninth to the fifteenth centuries tagged in TEI P5 XML. The Chymistry of Isaac Newton (2020) makes several of Newton’s alchemical writings available, allowing access both to diplomatic and normalised versions. Early English Books Online (EEBO 2020) does encode abbreviations, but not their expansions, as does the Helsinki Corpus TEI XML version. Abbreviations in this approach are very much seen as part of the allographic system and the aim of the project ORIFLAMMS (2020), which applies it, was to come up with a comprehensive typology for medieval writing systems. Several teams contributed to ORIFLAMMS. The project has published large corpora and made lists of abbreviations available on GitHub.
§57 There are also resources that do not use TEI encoding. The digital edition An Electronic Text Edition of Depositions 1560–1760 (Kytö, Grund, and Walker 2011) uses XML with a high degree of diplomatic accuracy but is not TEI-based. In English studies, there are a number of corpora that encode abbreviations in expanded form, using an ASCII-based system originating in the 1990s. There include the Edinburgh resources Linguistic Atlas of Early Middle English (LAEME) and Linguistic Atlas of Older Scots (LAOS), as well as the Middle English Grammar Corpus (MEG-C) and Middle English Local Documents (MELD) corpora compiled at the University of Stavanger.
§58 A more recent addition are developments in deep-learning algorithms and Handwritten Text Recognition (HTR) used for projects like HIMANIS (2020) and Transkribus. These tools are able to provide plain text querying and indexing of thousands of pages of medieval manuscripts using image data and enabling the use of distant-reading techniques to manuscript data. As they work on images, the algorithms are able to retrieve both abbreviated and unabbreviated strings in different volumes and different handwritings. Thus, developments in HTR and deep-learning promise to revolutionise enabling the quantitative study of abbreviations with “big data” in the not-too-distant future. Yet as Stutzmann et al. (2018) note, at the moment, there is still some work to be done before “uneven, automatically generated data” can be used as a reliable research tool. The risk is that automatic processes might generate huge quantities of bad data. Nevertheless, for scholars wanting to study abbreviations quantitatively there are now several resources available and the situation is likely to improve further. With the increasing availability of resources that enable quantitative palaeography, we might looking towards something of a golden age for this fascinating area of study.