You’re Collating Just Fine and Other Lies You’ve Been Telling Yourself

Although textual scholars agree that collation is a crucial component of the editing process, it often goes undefined and only briefly explained. This article defines the term, explains different kinds of collation, and explores some of its applications. We emphasize stemmatology and medieval textual traditions. By drawing from editorial examples and the theoretical frameworks of projects centred on works such as the Canterbury Tales, Troilus and Criseyde, Dante’s Commedia and the Greek New Testament, the article seeks to compare manual and computer-assisted approaches to collation methods. We delineate the scope of this activity and argue that computer-assisted collation minimizes the risk of missing out on relevant data. We examine the advantages of full-text collation over sample collation and conclude that no decisions about stemmatically significant variation can be made a priory and that variant distribution is the major factor weighing on significance.


Introduction
Although most theorists and textual scholars refer to collation in one way or another, they do so in passing, as if anyone coming upon their texts would understand unequivocally their meaning, and no one should ever require further explanations. In this article, we briefly examine the concepts behind the word collation, with a focus on textual collation, stressing the fundamental considerations for the optimization of computer-assisted collation. Because our research interest resides in the investigation of textual filiation, this article emphasizes the stemmatological purposes of our collations, describing and questioning some procedures. To conclude, we restate that editions depend on a limited number of variants and that how researchers identify and select the variants through collation affects their understanding of textual relationships and their perception of the text.
In researching this article, we found that a significant number of scholars write the word collation and emphasize the importance of the process without going into further details about the concept. We are not the first ones to point this out. In a piece published in 2017, Elena Spadini states: "If the reasons why we collate are well known, the way we do it, especially when we do it manually, is less documented: handbooks and essays seem to take for granted this delicate task or summarize it in a couple of sentences." (Spadini 2017, 245). Although Spadini is not strictly right in this assertion (see our discussion of David C. Parker's advice for manual collation below), the spirit of what she writes resonates with anyone researching collation theory.
In 2016, Tara Andrews presented, as part of the preliminary DiXiT activities before the conference of the European Society for Textual Scholarship in Antwerp, a paper which was eventually printed (in the same volume where the article by Spadini can be found) under the title "What we talk about when we talk about collation" (Andrews 2017). With its four printed pages, this is one of the most substantial considerations surrounding collation that we have been able to find. Andrews relies on the entry for "Collation," as found in the Lexicon of Scholarly Editing ("Collation" 2013), which partly explains the carnivalesque heterogeneity of her references. The authors of record are as diverse as Grésillon, Hockey, and Plachta. They include Biblical scholars (Colwell and Tune), Victorianists (Shillingsburg), Modernists (Eggert), and Armeniologists (Andrews herself). And yet, as Andrews seems to acknowledge, the mere mention of the word collation does not imply an explicit definition of the concept but rather the implicit understanding that scholars know what it is and how to carry it out. Indeed, more often than not, authors mention the process without pausing to reflect on the meaning of collation in the context of their own work. It is left to others to extrapolate what is being said and what is its context. Like the authors of the Lexicon, we have also found references to collation in the context of textual scholarship (Blecua 1983;Gabler 2007;Parker 2008;Waltz 2013;Trovato and Reeve 2014;Bordalejo et al. 2014;Bordalejo 2014;Driscoll and Pierazzo 2016;Bordalejo 2018;Fischer 2019), the vast majority of which live up to Spadini's description of articles that make a note of collation but never elaborate.

What is collation?
To collate is to compare by close examination. The word, however, has different specialized meanings, even among textual scholars. One can collate books or collate texts; one may speak of horizontal or vertical collation (Williams and Abbott 2009, 92). It is also possible to collate documents. Although these terms are not in any way obscure for specialists, it is reasonable to include them briefly as part of this article for two reasons: first, they could serve as a reference point for the textually curious; and second, they clarify exactly where the emphasis of our argument lies.
Some textual scholars are mainly bibliographers, i.e. their main interest resides in the physical characteristics of the book. Like book historians, they seek to learn about the book as a material object. When bibliographers use the term collation, they are talking about bibliographical description. In the context of analytical and descriptive bibliography, collation refers to the accounting for the quire structure that constitutes the physical form of the book (McKerrow 1927;Greetham 1994;Bowers 1995;Gaskell 2000). This is expressed in a collation formula, "a shorthand note of all the gatherings, individual leaves, and cancels as they occur in the ideal copy" (Gaskell 2000, 328). Bibliographical collation, as fascinating as it is, does not concern us for the purposes of this article.
In textual criticism, collation refers to the systematic comparison of two or more texts with the aid of a base-text or without one. Those texts could have come to be by different means: copied by scribes, printed in a manual press, or reproduced by a mechanic press. There are two types of text collation with a very different focus: vertical and horizontal.
Horizontal collation occurs while one is comparing different instances of the same print, a technique used by bibliographers to compare different copies of printed books, often using optical collators but now also replicating the optical processes by digital means. The objective is to detect differences that might shed light on the material history of the print to detect stop-press variants or resettings. The variants uncovered by this type of collation are not transmissional (as is the case with manuscripts copied by scribes) but revisional if errors are detected during the printing process, or remedial, when they are the result of a forme resetting due to accidents occurred during the book production process (Plachta 1995, 504-506).
An example of revisional variation can be found in Mari Agata's research. Agata detected stoppress variants in the Gutenberg Bible from which she inferred that the paper print predated the vellum print (Agata 2006;2011). It makes sense, as the less expensive material was used as a sort of preliminary state before the more expensive production was undertaken. For her research, Agata used digital methods, which included semi-transparent images in Photoshop and fast animation alternating images in Macromedia Director. The first method emulated what purely optical collators (McLeod's or Hailey's collators) would do, while the latter produced results similar to the optomechanical alternating lights of the Hinman collator (Hinman 1955).
Vertical collation investigates successive textual stages, usually within manuscript traditions or pre-print documents, instances likely to present a less than clear chronology to guide research.
There are different reasons to carry out vertical collations, which can be related to linguistic or textual matters. This article focuses on vertical collation and considers some of its potential applications with a particular emphasis on collation for stemmatological and textual analysis.
Nearly thirty years of research by the Canterbury Tales Project seek to achieve a thorough understanding of its textual transmission. This complex task requires many steps of which transcription and collation are fundamental.

Defining Variation
In order to compare texts, we have to define what is a variant. According to Vinaver's analysis of the "movements" that constitute the act of copying, scribal variants come to be if any of the following steps is ineffective: "(a) the reading of the text; (b) the passage of the eye from the text to the copy; (c) the writing of the copy; and (d) the passage of the eye from the copy back to the text" (Vinaver 1976, 142;Blecua 1983, 16-17). Then, an initial categorization of the nature of variants resides in the now-classic division of substantive and accidental. W.W. Greg coined the terms in his well-known essay, "The Rationale of Copy-Text" (Greg 1950, 21).
According to Greg, substantive variants are ones that "affect the author's meaning," this division responds to the way in which scribes or compositors "may in general be expected to react" (Greg 1950, 21). Greg posits that one can assume the aim with substantives will be to reproduce exactly those of their copy, while scribes or compositors "will normally follow their own habits or inclination" with accidentals (Greg 1950, 22). João A. Hansen in Volume 5 of his edition to Gregório de Matos e Guerra's poetry, expands on Greg's thoughts and hypothesizes that accidentals are more likely to be changed in the copying process because they are not considered to be as related to the author's will as substantives (Hansen and Moreira 2013 Greg and others like Ben Salemans seek to present a definition of variation, a priori. They are trying to establish principles that allow anyone to understand what a variant is and apply that notion of variant to any text. This process of defining variation a priori results either in a series of vague suggestions or in a prescription which will not be applicable to every textual tradition. Salemans lists characteristics of non-significant variants in his Building Stemmas with the Computer in a Cladistic, Neo-Lachmannian, Way: differences in capitalization, orthographic variants, dialectological variants, punctuation, word separation, difference in clause headers, ungrammatical sentences, nonsense readings, evident copy mistakes (Karel de Grote vs Krl de Grote), names, archaisms, frequently used word, synonymous parallelism, and inflectional parallelism (Salemans 2000, 68-70). Paolo Trovato celebrates Salemans' ability to distinguish between variants, that are numerous, polygenetic, and irrelevant, and significant errors, which, according to him, are as a rule few, can derive from previous copies, thus, are useful for the construction of a stemma (Trovato and Reeve 2014, 110). After analyzing Salemans' list, Trovato considers the following as significant variants: variation in word order, following rhyming conventions in verse, addition or omission of words when they are not small or very common (Trovato and Reeve 2014, 111). Although O'Donnell makes a similar distinction, when he describes significant variants: [The] apparatus entries include only those forms that might be understood as involving a change in metrical, lexical or syntactic "significance" from the editorial lemma: i.e. variation involving the substitution of one lexical form, metrical pattern, or syntactic construction for another, or the irreversible destruction of sense, metre, or syntax. Such substitutions include variation between contextually appropriate alternatives... and contextually inappropriate alternatives and nonsense forms that cannot easily be restored to the archetypal form. (O'Donnell [2005] 2018, §7.8) By referring to "potentially" significant variants, O'Donnell suggests that the judgement on significance is made at a later point; otherwise, there would be no need for the qualifier. Trovato instead subscribes to Caterina Brandolli's efforts to describe what a polygenetic variant is, that is a variant that scribes produced independently, not from a common ancestor (Brandoli 2007), what we refer to as agreement by coincidence. Since Brandoli's results largely agree with Saleman's (Trovato and Reeve 2014, 220). Trovato's conclusions aspire to be comprehensive and applicable to every textual tradition. It is an argument that favours the judgement of variants a priori, that is, the editor decides what is relevant for stemmatological purposes.
In her doctoral dissertation, Bordalejo outlines the criteria she used for collating the texts of Caxton's printed editions of the Canterbury Tales: "I have considered as significant all additions, deletions and substitutions, all the changes in word-order, all substantive variants [as opposed to Greg's accidental variants]" (Bordalejo 2002, 104). She agrees with some of Salemans' categories, but at no point does Bordalejo indicate that these criteria are applicable to any other textual tradition or even to other aspects of the study of the Canterbury Tales. It is more productive to describe the type of variation taken into account during the collation process and how the results of the collation were employed in the creation of the apparatus. There is no use in attempting to create a definitive list of what is relevant genealogical variation for every textual tradition. Vázquez, in his article "Transcribing and Collating for Digital Stemmatology" (Vázquez Forthcoming), shows examples of how some of the items in Salemans' list can be questioned. Furthermore, Bordalejo's approach aims to postpone the judgment of variants since the advances on stemmatics, and the use of digital tools do not require the a priori judgment of readings that would likely belong to the archetype (Bordalejo 2002, 98). This emphasis on "potential" stemmatically significant variation is similar in approach, if not in substance, to what O'Donnell proposes. Although the judgment of variants is postponed, it must be emphasized that it is not abandoned; but only after and through the lens of the analysis of all the variants is it possible to make legitimate claims about the genealogy of the witnesses of a textual tradition (Bordalejo 2002, 99). Hence, the difference between prescriptive and descriptive stemmatology: the distinction between assuming that one knows what should be taken into account versus allowing the textual evidence to speak for itself while also being aware of the objectives of an individual project. In the case of the Canterbury Tales Project, research concentrates on stemmatically significant variation that can (potentially) contain genetic information.

The purposes of collation
In the context of textual criticism focused on medieval materials, collation can have one or more end goals. Scholars might be concerned with the range of variation present in a series of texts, they might want to achieve a better understanding of the relationships between different instances of a text, they might want to explain the historical or linguistic circumstances surrounding a textual tradition, they might want to understand the development of a text, or they might be seeking to isolate readings to include in their editions. These are just a few examples of things that can be achieved by collating texts. Each scholar can choose to focus on some aspects more than others, or they can use multiple simultaneous approaches. Although Andrews states that "[t]he comparison may be done at the word level, at the character level, or at another unspecified syntactic or semantic level, according to the sensibilities of the editor," (Andrews 2017, 232) we maintain that the choice has little to do with "sensibility" and everything to do with knowledge and the nature of the extant primary documents.

a) Apparatus
A common goal of collation is the production of a critical apparatus for inclusion in an edition.
The type of edition (historico-critical, genetic, reader's edition) is not as relevant as it is the dialectic relationship between apparatus and text. Marina Buzzoni makes this distinction when she states: The apparatus is indeed different from the descriptive lectio variorum one can get by, say, applying any collation software to the transcription of the witnesses. The apparatus is critical-i.e. interpretative-in that it accommodates certain variant readings, and excludes some others, according to the editorial principles to which the philologist conforms. (Buzzoni 2016, 76) She goes on to explain that while stemmatologists might only record those variants that preserve genetic information (excluding singletons, for example), anyone interested in the linguistic features or language evolution, might include the lectiones singulares as a record of a philologically significant moment (Buzzoni 2016, 76).
This was particularly pertinent in light of the discussions of a possible change of name for the TEI Critical Apparatus group. Suggestions like "textual variance" or "textual variants" do not carry the same weight or implications of "critical apparatus." The relationship of the later with the text it informs, differs from the output of a collation process.

b) Stemmatology
For stemmatological purposes, transcription and collation form a solid base from which scholars can investigate the relationships between different witnesses within a textual tradition. To research textual filiation, scholars must first decide which variants are going to be deemed stemmatologically significant. And yet, every editorial decision made before that point affects the potential results.
A thorough understanding of matters related to the textual tradition builds a trustworthy foundation from which collation decisions can be made. A dependable collation should, in turn, provide the basis for further research. This can be carried out by hand or, as is the case of the Canterbury Tales Project with a combination of phylogenetic analysis and advanced database searches (see Bordalejo's article on analysis of textual materials using these techniques). Below, we refer to various projects using collation for stemmatological purposes: the Commedia, the Canterbury Tales, Troilus and Criseyde, and the Greek New Testament.
A collation for stemmatological purposes requires, at least in the case of medieval texts, a process of regularization and alignment. See below the section on Chaucer's Canterbury Tales.

c) Corpus Linguistics
It is clear that linguistics projects are not within the scope of textual criticism. However, there are some parallels regarding the treatment of words, especially the process of regularization and lemmatization since each groups different items under disciplinary criteria to improve the usability of the data. The Diachronic and Diatopic Corpus of American Spanish (Corpus Diacrónico y Diatópico del Español de América referred to as CORDIAM henceforth) is a project recently supported by the Association of the Spanish Language Academies, hosted by the Mexican Academy of the Language, and is possible due to the collaboration and direction of Mexican linguist Concepción Company and Uruguayan linguist Virginia Bertolotti. The aim of this project is to: a) historicize the development of American Spanish, b) to accomplish a historical dialectology of American Spanish, and c) to achieve a complete and rich study of the American Spanish history, without geographical or dialectological parcellations when these are not required, given that to speak and write in Spanish is an integral reality, common to hundreds of millions of Hipanophones." (Company and Bertolotti 2018, 78) It is unfortunate that this is a needed clarification. In this text, "American Spanish" does not mean 'the Spanish of the United States of America,' it means 'the Spanish of the American continent.´ CORDIAM's interface asks to write a word so that it can perform a search through the 12,907 texts and 9,644,566 words that it has up to May 21, 2020. A crucial feature of CORDIAM is that 70% of the corpus is lemmatized (Company and Bertolotti 2018, 100), which allows for complex searches. A lemma is "the technical term in lexicography and linguistics for a lexical item as it is presented in a dictionary entry, for the sake of clarity and economy" (Butterfield 2015). Thus, to lemmatize "is to group together varying words or forms of words: e.g. in work on a concordance" (Matthews 2014). One can search the Spanish verb ir (to go) plus a a preposition that indicates direction and another infinitive (infinitives in Spanish end in -ar, -er, -ir), which would be the equivalent to the English construction "to go to + infinitive." The results include conjugated forms of the verb ir and the rest of the phrase. Since the verb ir is irregular, this can only be done because of the lemmatization process. Lemmatization is somewhat resonant of regularization in the collation process. While the purpose is different, the idea is to group together words for their analysis. As in a computerassisted collation, words undergo a tokenization process where "plain text can be split into words at whitespace" (Nury 2018, 77) to be further treated in accordance with the project's goals. In linguistics, a lemmatized corpus like CORDIAM can be used for the historical study of grammar and syntax, for lexicology, and dialectology, among others. The fact that one can survey complex constructions that bring together the different conjugations of verbs is proof. For the textual scholar, the regularization and alignment that follow tokenization allow the textual critic to compare the units he or she judges to reflect variation in order to differentiate between different states of a text given to authorial intervention or transmission.

d) Popular lyrics
Margit Frenk's Nuevo Corpus de la antigua lírica popular hispánica is, as the title indicates, a corpus of traditional lyrics from the 15th to the 17th century. The purpose of the corpus is to register traditional lyrics. Given the traditional nature of these poetic compositions, there is no author. As the compilator explains in her book Entre la voz y el silencio: The collectivity possesses a poetic-musical tradition, a limited amount of melodic and rhythmic resources, of literary motives and topics, of metrical and stylistic modes (limited amount of resources does not mean small or reduced). The author must move within these limited resources so that his composition is accepted as well as the innumerable individuals that will retouch it and transform it as time goes by. There is a very small margin left for originality and innovation, although these are not excluded. (Frenk 1971, 11).
The authorial figure and the uniqueness of his or her expression and the intention of his or her artistic endeavor is not the focus in this quotation. Since this is a corpus of traditional lyrics, every intervention is equally important. It belongs to the community. The nature of these texts guides the goal that the editorial criteria obey.
Frenk states that this work is "something like a critical edition" (Frenk 2003, 23). The main concerns are the "fidelity to the original sources, the scrupulous attestation of variants, and the coherence of the editing criteria" (Frenk 2003, 23). For Frenk, a version of a lyric is "every occurrence of a lyric, either in different sources, in the same source, whether it presents variation in regard to the "base text," or if it is identical to it" (Frenk 2003, 24 n23). The focus is on the recurrence of the lyric, not on originality. Frenk discusses the selection of a base text by stating, "given that this is an oral tradition, there is no version that can be considered either the first or the best -initially, they all carry the same value-" (Frenk 2003, 25-26). Therefore, the criteria for selecting a base text has to do with commonality: "thus, the preferred feature has been to choose the version that has the most features in common in respect to the others, so that it reflects what may have been the preponderant version through the xvi and xvii centuries as well as to reduce the apparatus" (Frenk 2003, 26). Frenk collates the text of the different versions to meticulously register the variants in the apparatus, but not to reconstruct an original because that notion is alien to this literature. Aside from registering the variants in the apparatus so that the reader can reconstruct the text of the witnesses, Frenk explains the apparatus has a chronological direction so that it is possible "to build an idea of the temporal trajectory of the changes" (Frenk 2003, 26).
The corpus registers more than 2000 traditional lyrics in a two-volume edition that have around 1000 pages each. Needless to say that the physical format makes it a difficult corpus to work with because of the size, plus one needs to get acquainted with the relation that the base text and the apparatus hold. If this project had started in an era where digital publication was available, it is possible that the Nuevo Corpus would have the format and the appearance of a digital variorum edition. In essence, it is a variorum since it attempts to represent as many versions of a text as possible to enable the reader to navigate the dynamics and results of traditional lyrics. It is a celebration of difference, rather than a judgement of what belongs to the critical text and what must be relegated to the apparatus. Frenk uses the tools from textual criticism to present lyrics in a universe where the recurrence of the poems and chronological order are more important than authors and originality. The purpose and the nature of texts guide how and why to collate.

Different Approaches to Collation.
The importance of manual collation methods resides in the fact that, conceptually, they also form a natural basis for our understanding of computer collation. For example, the use of a base-text as a referencing system that allows us to record the diverging variation behind some types of automatic collation, while pair-comparison of texts forms the basis of other systems. Manual collation is an error-prone activity. Blecua suggested that collation should be done by more than one individual since "aside from the slowness of the process, errors caused by jumping from the reading of one of the witnesses to another are numerous" (Blecua 1983, 45). Full-text collations based on full-text transcriptions minimize the risk of ignoring relevant evidence. We examine a few examples of manual collation to show how they relate to computer-assisted methods and how they compare to them.
We also compare sampling versus full-text collation. The general procedure for sampling collation is to select a few witnesses that are regarded as significant, and from those, perhaps a full-text collation could follow. Nury argues that "[i]n practice, when dealing with a large manuscript tradition, only a few witnesses will be fully collated, i.e., the ones selected as the most relevant to edit the text" (Nury 2018, 68 (Parker 2008). Parker states that there might be cases in which a scholar is circumstantially forced to carry out manual collation, and, for that reason, he suggests a series of considerations that should be helpful to anyone caught in a computerless situation (Parker 2008, 95-100). Although Parker focuses on the Greek New Testament and so some of his references are only relevant for the edition of said text, most of the described technique should be useful for anyone attempting to produce hand-written collations of manuscripts. Parker makes an important distinction between collation and transcription, where he states: The general rule is that you are recording data that will be of value in establishing the relationship between the manuscripts, not all the data about the manuscript you are collating. If you wanted to do the latter, you would make a transcription. (Parker 2008, 97) This separation of transcription and collation, so clearly drawn for the purposes of a hand-written collation, is also essential for our understanding of how a digital collation is carried out. In Parker's view, a collation of the Greek New Testament should not include punctuation, accentuation, italiscisms, spelling variations, or abbreviations (Parker 2008, 97). Although it might appear as if Parker were being prescriptive, the reality is that he has been working on the manuscripts of the Greek New Testament since he began his career, and with computer-assisted collation for some 25 years. Parker is describing the collations produced within his John 18 project and, in fact, describing how his digital transcripts are regularized when they are processed using collation software, translating his procedures into a possible manual collation.
b. Sampling vs full-text collation. Loci and deteriores.
In order to decide what to collate, long before digital collation, there were two routes that can be illustrated with Barbi's loci and Petrocci's edition, that is, limiting the variational places subjected to analysis or limiting the number of witnesses due to extratextual criteria. According to Alberto Blecua, when there is no textual tradition, or the previous analyses are not trustworthy, the collation of test passages should be done in order to choose a text to serve as the base for collation. He adds that when the textual tradition is abundant, and a complete collation is close to impossible, the loci critici method can be applied. This approach examines passages that have been considered problematic. Blecua further states that the purpose of this method is to eliminate codices descripti or deteriores, but he warns that although the method is fast, it can be dangerous (Blecua 1983, 44 Barbi was fully aware that this was not the only necessary collation; it was merely a preliminary operation, but one that would be sufficient for picking out from the vastness of the tradition, on the one hand, all mss. belonging to the most numerous families, on the other, the more promising isolated witnesses -independent of or predating the formation of the vulgate texts-which could then be subjected to a more "accurate examination." (Trovato and Reeve 2014, 305) He refers to Barbi when he stated that nothing certain could be concluded after a partial examination given the "oscillation" he saw (Trovato and Reeve 2014, 305). Doubts about Barbi's loci accuracy are not new. Robinson reports that Petrocchi "warned explicitly against the use of the Barbi loci as a base for the textual reconstruction of the whole tradition" (Robinson 2012, 24) because any partial collation can only offer "a preliminary and general sense of direction within the thicket of relationships among the manuscripts" ( Robinson 2012, 25), something Barbi had expressed.
Sebastiano Timpanaro argued against the collazioni saltuarie in favour of systematic collation (Timpanaro 1985, 6, 11,30-31). He considered it a mistake to only collate predetermined places of variation while he thought that the collation of complete texts would yield more solid results.
Greetham describes the process as a "trial collation" (Greetham 1994, 365) used to establish an initial base-text. It goes without saying that such a process would entail a collation in two stages: first, to establish the base-text, second, to establish filiation. We refer to this as sampling collation in recognition that many scholars do this without the intention of identifying a base text or ever moving towards a full-text collation.
Giorgio Petrocchi offered a different solution. He did not aim to provide the authorial text of the Commedia. His edition La commedia secondo l'antica vulgata was intended to represent "the standard edition of the 1330s" (Cherchi 2008, 413 (Vergani 1969, 192). Boccaccio's intervention in the textual tradition of the Commedia produced "untrustworthy editions of the poem around the middle of the Trecento" (Cherchi 2008, 413). Bocaccio created his own recensions borrowing variants from different manuscripts and performing emmendatio ope ingeni when deemed preferable (Petrocchi 1966, 9). For Petrocchi, "Boccaccio contaminates more than he corrects" (Petrocchi 1966, 45 achieve in over a hundred years), and from analysis of the readings at these loci he had created a comprehensive account of the whole tradition, and isolated just seven manuscripts as necessary and sufficient for the creation of a critical text. (Robinson 2012, 5) These witnesses were later called the Sanguinetti seven. Shaw and Robinson created "became a test of Sanguineti's arguments about the relationships among these seven manuscripts" (Robinson 2012, 6). Sanguinetti argued that out of all the witnesses that have been divided into 302) (Rb). This argument renders Sanguinetti's editorial work unfruitful. The reason for the discrepancy is that a high proportion of the variants considered by Shaw and Robinson as demonstrative of genealogical relevance "would not satisfy Barbi's criteria" (Robinson 2012, 28). Yet, it is not the a priori appearance of these variants what supports their argument, but the consistency of the agreements between these witnesses which reveals a pattern, or as Petrocchi would put it and Shaw quotes, the "foltezza di statistica" (Robinson 2012, 29 Robinson presents five readings that are "likely to have been introduced by the common ancestor of Urb/Rb; none of these five lines appear among the Barbi loci" (Robinson 2012, 27 (Brandoli 2007, 113). As for Pettrochi's key passages, she states that 282 out of 477 seem to be monogenetic, which is still a majority but not as decisive as Barbi's. Then she points out that out of those 282, 132 were already considered by Barbi (Brandoli 2007, 122), which would seem to prove that Petrocchi was only able to find 150 truly relevant readings and has a tendency to take into account polygenetic variants that are not of interest and would compromise his results. This difference extends to the discrepancy between  which culminated in its current version (Bordalejo and Robinson 2018). In this way, the project produces rich and detailed transcriptions that record places of variation within each document (with the use of the apparatus element) and which can be published alongside the images, thus allowing readers access to the same resources we use in our analysis.
Our work is designed for generosity. We want others to benefit from the many hours spent on the creation of individual transcripts, which is why we make those available for reuse. Since a significant portion of our research was funded with money from various governmental sources, we have the duty to make them available for others to use in their research. Perhaps other scholars are interested in analyzing idiosyncratic peculiarities or quirky spellings. Someone interested, for example, in toponymy, could encode all the place names and repurpose our transcriptions to build maps; or they could encode historical figures or characters for a different type of study.
For our own purposes, if our initial transcription fails to record variation, it can be easily modified and processed again. Because we have complete transcriptions, we can choose to publish editions of individual witnesses (Stubbs 2000;Bordalejo 2003) or of multiple ones (Chaucer 2004;2006). We can rebuild our transcriptions for different purposes, and select what version of the text and in which format will be displayed (see Bittner and Dase's description of the multiple encoding of our apparatus tag).
If there is a downside to the full-text transcription is the trap of the illusion of control. The transcription of complete witnesses postpones the detection of variants. Moreover, the researcher might feel some relief as she waits for the eventual results of the computer-assisted collation.
There is safety in not having to jump in a vast sea of variation with decisions being made at the moment.
As we collate the Canterbury Tales, we carry out two further processes: regularization and alignment. The aim of these processes is two-fold. On the one hand, we seek to flatten spelling information in order to analyze the collation results with the help of evolutionary biology software. As we have pointed out before, the project seeks to do a genealogical analysis of the textual tradition, and spelling variation obscures the filiation of the witnesses. On the other hand, we intend to produce a readable apparatus that can be easily understood by human reading. In the same way that the project's transcription guidelines have evolved, so have our regularization ones. The original guidelines for regularization indicated that the different witnesses should be regularized to a very lightly edited version of Hengwrt (National Library of Wales, Peniarth, 392 D), likely to be the oldest manuscript of the Tales and witness to one of the best texts in the tradition. But the words "lightly edited" mean nothing unless further specified. By lightly edited, we meant that all medieval characters, no longer in use in contemporary English, were substituted by modern forms, abbreviations were expanded, and all lines found in any witness not present in our base-text added. This created our base-text for collation, which Collate used for comparison purposes. Later, as the edition was built, the base-text for collation was discarded.
Although there would have been little interest in the base-text, this was useful because it recorded most of the spellings that we would use for regularization purposes. Our rule for this was to regularize to the most common spelling of Hengwrt, and we had a list of preferred witnesses in case this turned out not to be possible (in cases in which Hengwrt did not include a particular term, for example). Somewhere else we explain the concept that underlines our view on variation: It is our core conviction, based on decades of work with digital tools, that "significant variants" are defined entirely by how the variants are distributed across the whole tradition. That is: if we find a number of variants which are present, over and over again, in the same distinctive pattern of witnesses, then those variants are significant. (Bordalejo and Robinson 2019, 37) For the purposes of the Canterbury Tales Project, whether variants are polygenetic or monogenetic is not crucial. Instead, we consider variant distribution, which we can only know a posteriori, the essential factor to determine stemmatically significant variation.

e. Chaucer's Troilus and Criseyde
There is no stemma for the textual tradition of Troilus and Criseyde. Previous editions and analysis (Chaucer 1926;1984; (Root 1926;Hanna 1996).
This method postpones judgment. The project does not seek to make general claims on the textual tradition based on the analysis of 1350 lines. The aim is to analyze the selected excerpts rigorously. The full transcription of the excerpts and the full collation of them enables the textual critic to get acquainted with the material. The semi-automatic collation aids the collator and reduces the risk of error. One of the advantages of semi-automatic full-text collation is that even if the collator misses one or two variants, the direction of the collation is less susceptible to change since it relies on the complete body of evidence. The stemmatological aspect of the project does not rest on the selection of a few variants. As seen with the discrepancies between Barbi and Petrocchi, or Shaw and Sanguinetti, variants that would escape the attention of the manual collator and alter the direction of genealogy, are less likely to be missed by a textual critic that uses digital tools. The Troilus project is far from complete, but it shows consistent results so far (Vázquez 2020).
By comparing the critical apparatus that Barry Windeatt and Robert K. Root provide in their editions, with a computer-assisted collation we can see that there is a small omission as early as line four. The line reads "fro wo to wele, and after out of ioie" (Chaucer 1984, 84). All the witnesses but one agree on "out of." Rawlinson reads "on to." Neither apparatus registers this variant. It is a small variant, it is also not relevant for genealogical purposes since it is not present in any other witness, yet one may argue that since both their editions walk away from making genealogical statements, the apparatus is there to inform the reader of what is present in the textual tradition. On another occasion, Windeatt's apparatus draws attention to the fact that the text in the Corpus manuscript presents a peculiar spelling in the third line, thus : "auentures] auentuirs Cp (3 minims after t)" (Chaucer 1984, 85). Thus, a lack of interest in small detail is not to blame for Windeatt's disregard for the R variant: the most likely explanation for this omission is that it is easy to miss when doing manual collation. A digital collation tool that draws variants from full-text transcriptions cannot fail to bring this reading to the collator's attention.
f. The Greek New Testament, Nestle-Aland, and the Editio

Critica Maior
The Institute for New Testament Textual Research has collected the necessary materials of the textual tradition of the Greek New Testament. It was then necessary to filter the material so that "new views about important manuscripts find their way into the minor editions of the institute, and finally to present an Editio Critica Maior" (Mink 2004, 17). In order to access the relevance of the text of manuscripts, a sampling method was devised and the results published "in the five volumes of the Text und Textwert der Griechischen Handschriften des Neuen Testaments" (Morrill 2012, 7). It meant to separate the manuscripts that contained "the relatively uniform text which was standard at the end of the Byzantine tradition from the still large number of manuscripts which must be considered relevant on account of their deviations from the majority text" (Mink 2004, 17). The results made it possible to select manuscripts that do not contain the uniform text from the end of the tradition.
The Claremont Profile Method that was used for the classification of 1385 manuscripts of the Gospel According to St. Luke (Morrill 2012, 23) is another example of sampling collation. The test passages were picked out after "the complete collation of 282 manuscripts. A sample of three chapters, 1, 10, and 20, were selected, and all variation in these chapters was evaluated" (Morrill 2012, 21). After careful consideration, 196 were used to analyze 816 more manuscripts "for a total of 1385 manuscripts in 196 passages" (Morrill 2012, 23), and fourteen groups of witnesses were created after the analysis. The purpose of this collation was not to create an apparatus, not an edition. The reasoning was that the "critical apparatus would adequately represent the manuscript tradition if it included representatives of the groups, plus those manuscripts that did not fall into definable groups" (Morrill 2012, 19). It cannot be denied that the Claremont Profile Method achieved remarkable results, yet it will always be preferable to rely on full-text transcriptions and full-text collations.
The Institute for New Testament Textual Research has used computer-assisted collation tools for more than 20 years. First Collate and now CollateX have played a significant role in the development of their editions. Both the Nestle-Aland Greek New Testament and the Editio Critica Maior built their research and apparatuses with the help of semi-automated software after carrying out complete transcriptions of witnesses (Houghton et al. 2020).
Bordalejo has before described the Editio Critica Maior as a born-digital printed edition, in reference to the techniques used in its construction and as an example of how similar digital textual critical tools are to their analogue counterparts, despite their speed and higher accuracy (Bordalejo 2013, 65n). What is remarkable about this edition is how its apparatus, based on a computer-assisted collation, differs from that of the Nestle-Aland edition. Despite the same tools being employed by both, the distinct purposes of the editions are made evident in the apparatuses generated from very different collations.

Digital Collation Tools
There are several tools that can be used in order to conduct a computer-assisted collation. In this section, we examine some of the tools with collation features, their characteristics, and their potential uses.
i. TUSTEP TUSTEP (TUebingen System of TExt processing Programs) is a toolbox for scholarly processing textual data. According to Gabler, the "algorithm was devised 30 years ago and is still among the most powerful and efficient of collation" (Gabler 2007, 4). A demonstration of TXSTEP in 2015 shows the algorithm at work. It shows versions of the text, but it seems like one has to program subroutines so that the collation shows significant variants instead of different spellings (like the German long s) or punctuation marks.  While it may be easier to use and the interface is friendlier, Juxta will be handy for people that are interested in collating texts with unified orthographies. Juxta, as their example shows, is an efficient way to show differences between documents.  Juxta highlights the places of variation and differentiates degrees of variation by using different hues of blue. It is not possible to see more than two witnesses in the side by side view. The TEI tagging process took a long time, and even though these are eight witnesses, it should also be said that this is just one stanza. It would be next to impossible to use this feature with the 18 witnesses or with more than seven lines. But the main problem is that for a textual scholar that studies texts and deals with non-fluid spelling systems, this tool is almost unusable. It is good at showing difference, which is not the same as variation, or not the variation that might be interesting for a textual scholar. In other words, although the alignment is accurate, variation cannot be properly displayed because there is no option for regularization. Differences in orthography are important, especially if they are indicative of dialectological information, or if one is interested in studying particular manifestations of different usus scribendi, but it is not useful for the creation of a critical edition that researches the genealogy of the witnesses.

iii. Collate, CollateX, and the Collation Editor
Collate (Robinson, n.d.) is specialized software developed by Peter Robinson. The experience of using Collate was a window into the mind of its developer and evinced Robinson's understanding of textual criticism and scholarly editing.
Collate is complex because the tasks editors require the software to perform are complex.
Preparing the files for use with Collate was a necessary prerequisite if one intended to get the best results and, although the software handled ASCII files, the text was processed more precisely if a light encoding was included to mark textual sections (Robinson 1994, 32

Conclusions: Learning from Collation
Collation is not an isolated process that happens in a vacuum, but a practice occurring within the wider context of textual-critical research and which, at times, leads to the production of an edition. It is the process of variant identification that allows scholars to further their research agenda. However, editions depend on a limited number of variants. Even in the case of digital editions, where every piece of variation can be included, we are forced to reckon with the reality represented by our limited number of texts. Human intellect is also limited, particularly when it refers to a large number of items and, for this reason, we rely on other systems to help us process the vast number of variants we detect during a regular collation process.
We have shown that, although the collation process relies on the same principles, whether it is carried out manually or with the aid of computers, there are significant advantages in using computer-assisted methods over manual ones. Although the preparation of files for computer collation requires a significant investment of time and effort, by creating full-text transcriptions and making them publicly available, we ensure that our work can be evaluated by other scholars and reused in future research. These advantages are more evident in the context of research on large textual traditions, but they do not disappear in reference to briefer or less distributed texts.
The success of a critical edition relies on its ability to connect a system of data. With computerassisted collation methods and full-text transcriptions, the process that leads to a critical text becomes comprehensive, thorough, and more transparent to the reader. By consequence, the critical text turns into a window through which we can observe the circumstances and the intervention of many of the agents that made it possible for us to connect/engage with the texts that weave us as part of a community. A great deal of attention was required to detect those variants that could have the potential to be stemmatically significant. In the previous example, such variants are represented by "And the/The" in line 2 and "licoue/lycour" in line 3. Bordalejo read all the variants between the editions attempting to isolate those with potential stemmatic significance. Although the aim was to record the majority of the variants, and great care was devoted to the data gathering, human error remained a consideration. However, the proportion of the potential differences between the electronic and manual data gathering were not clear, not until the regularization process had been The latest can be further divided into errors helped by the display in Collate, which was not optimized to be read but rather to be processed, and inaccuracy or oversight.
1. Transcription mistakes could have occurred at any point. These could have been input after Bordalejo's final collation, but before the Miller's Tale on CD-ROM was finished or they could have been corrected for the publication, in which case the mistake would have remained part of the final collation.
2. The Canterbury Tales Project uses parallel segmentation. This ensures that variant phrases are reduced to their minimum variant expression (sometimes two words but in a few cases six or seven). However, Collate just shows sequences of differences. When evaluating the Cx1/ Cx2 collation, variant phrases were counted as a single variant (even in cases in which these could have been separated into more than one).
3. Because the regularization process is carried out by humans), it is prone to error.
Occasionally, there has been overregularization ( This gives a total of 14 variants which we need to subtract to obtain 103 net variants of which 79 had been correctly isolated. This translates into 24 variants which were not detected in the original Caxton collation when reading the Collate output. Of those 24 overlooked variants: 3 are related the Collate display 2 (MI 208 and MI 210) were errors in judgment due to lack of experience.
The other 19 even though present in the original collation between the Caxton editions, were not detected. Four of those occurred in lines that have some other kind of variation, and it might be possible that their closeness to other variants might explain why they were overlooked. This leaves a total of fifteen missed variants for which we cannot find any explanation. Those fifteen variants translate into around 20% of the variation. Projecting such a number to the rest of the text would be a total of some 600 undetected variants if one were to carry out a manual collation with a steady error rate.
Even with the risks and decisions made during the regularization process, its accuracy rate is much higher. Since its publication in 2004, only three regularization mistakes have been found in the Miller's Tale. Not taking into account disagreements with the editor or parallel segmentation matters, then the machine aided collation has an inaccuracy rate of only 2%-a much better ratio than the one achieved by Bordalejo.
If we assume that these numbers are representative, the case for computer-assisted collation is solid. Although it is possible to produce more accurate manual collations if researchers can dedicate time and effort to their work, it seems evident that the computer's capacity to count and classify should produce better results when a similar timeframe is allowed and provided that the data is relatively free of errors. Although computer-assisted collation is time-consuming in its data preparation, it is a superior system when limited time is available.