§ 1 As more academic projects produce electronic resources, they are inevitably converting from analogue sources such as manuscript and print. Moreover, as electronic resources increase in age they too need maintenance and conversion in order to be properly preserved. This article looks at three such conversion projects involving different types of data from three distinct projects: CURSUS, Records of Early English Drama (REED), and Oxford Text Archive (OTA). The CURSUS Project created dynamically assembled electronic texts by transcribing from medieval manuscripts and retrieving any critical apparatus from a centralised repository. In producing structurally marked up texts from the legacy data of the REED Project, I explored some of the possible methods that could be used to convert print resources, in this case starting with the fortunate survival of Quark typesetting files used to print the original texts. Finally, the Oxford Text Archive undertook an examination of a subset of its legacy data to provide information concerning the possibilities and problems of migrating many of these resources to much more modern formats. As a part of this project, I developed a sample conversion method for COCOA verse drama.
The CURSUS project
§ 2 The CURSUS project (http://www.cursus.uea.ac.uk) received a fixed three years of funding from the Arts and Humanities Research Board (now Council) to use new technologies in the production of an online resource of medieval Benedictine liturgical manuscripts. The project was directed by Professor David Chadd, of the School of Music at the University of East Anglia, and I was hired as the only research associate. Both Professor Chadd and I were responsible for transcription and markup of manuscripts, and I had the additional responsibility of developing the technical aspects, which meant becoming familiar with various associated web technologies. The intent behind the project was not only the creation of accessible editions of the manuscripts in a preservation-friendly format, but also the exploration of the technologies available for such publication.
§ 3 Liturgical manuscripts are often ordered by liturgical day. These in turn contain the various services (Matins, Lauds, Vespers, etc.), which in turn list the antiphons and responds to be sung, and the various readings and prayers. In deciding how to encode this information electronically, we recognised the inherent hierarchical structure. For the most part, service items like antiphons are inside services, which are in turn inside days. This is one of the reasons that it was decided that using XML would be an appropriate method to mark up the aspects of these manuscripts which we wished to record.
§ 4 If one is going to have liturgical days encoded in multiple manuscripts, it seemed sensible to give these days some sort of code or ID number, so that the same days in different manuscripts could be compared. The comparison of liturgical days, and more specifically the running order of the service items (e.g. antiphons, responds, prayers, readings, etc.) is central to the investigation of the relationships between liturgical manuscripts and their traditions. Luckily for us, another medieval liturgy project, the CANTUS Project (http://publish.uwo.ca/~cantus/), had already come up with a comprehensive numbering system. By converting a list they had of liturgical feasts with their descriptions to XML we were able to then refer to this list of ID numbers in processing our own documents. This meant that we could produce a consistent naming strategy for each of the liturgical days. Although the CURSUS Project does not link directly to any of the output of the CANTUS Project, the use of the same numbering scheme enables future options for interoperability between the resources of the two projects.
Corpus Antiphonalium Officii
§ 5 It is not only the liturgical days that have an accepted numbering scheme, but also the service items themselves. There exists an enormous printed collection of antiphons and responds based on 12 manuscripts: the Corpus Antiphonalium Officii (Hesbert, 1963-1979 , hereafter CAO). Amongst other things, this work provides an alphabetically-organised set of around 8000 critical editions of antiphons and responds found in these manuscripts. As such, it is a good resource for those studying any of the many manuscripts that Hesbert did not include as it provides a base text from which to identify a newly edited antiphon or respond. The existence of this work meant that many of the antiphons and responds that we came across had already been edited for print publication. More importantly, these have been assigned an ID number that is used internationally to refer to particular items. If you tell a liturgical scholar that you are talking about a variant of antiphon CAO:1208 (http://www.cursus.uea.ac.uk/ed/c1208), they know which one you mean (or at least can look it up).
§ 6 Just as the list of CANTUS codes gives us information about each liturgical day in an external file, CAO provides us a preset list of ID numbers for every antiphon and respond; it makes sense that we should make use of it for our antiphons and responds. There are different ways in which we could use this information. We could have just added this number as an attribute to every antiphon or respond when editing individual versions of the manuscripts. This would enable a simple relationship between the edited items in different manuscripts. However, this would only allow a very limited degree of comparison, and one of our goals was to enable the production of more complex critical editions of individual service items where all the variants were clearly visible. So, we decided to do something a bit more complex.
§ 7 I am not a liturgical scholar—my research interests are in manuscripts, electronic text encoding, and medieval drama—but it became immediately apparent that the textual differences between the same antiphon in two different manuscripts were usually only very minor scribal variations. Furthermore, when there were textual variants, they often tended to group together into identical or related families of variants, perhaps based on manuscript transmission.
§ 8 Since the text of these service items is often so consistent between manuscripts, instead of having separate edited items in each manuscript, it seemed more beneficial to store all the antiphons, responds, and prayers in a single place and mark up the differences between the various manuscript readings as we encountered them. Then, in our editing of the manuscript, we could just point to the ID number of an antiphon and, when displaying the manuscript, use that to pull out the correct reading for the antiphon from that central XML repository.
§ 9 Although this would take more time than doing individual editions, it was decided that this was the better practice since it allowed the system not only a great ease of comparison amoung manuscripts, but also enabled us to extend the system by adding new antiphons or manuscripts as we went along. In order to do this we digitised and marked up in XML a copy of the almost 8000 antiphons and responds from CAO. This has the added benefit of not only allowing comparisons between the readings of the manuscripts that the CURSUS Project edited, but also the manuscripts originally used to compile CAO. In some ways, this created an electronic version of these portions of CAO and then introduced the readings from the CURSUS manuscripts (http://www.cursus.uea.ac.uk/manuscript) into it.
§ 10 This means that for each antiphon, respond or prayer added to an individual manuscript file, two files need to be edited. In the manuscript file we would add a <xptr> element which would contain the ID number of the antiphon. The <xptr> element is an extended pointer which points out to the location of the relevant content in the CURSUS repository. In the central repository we would locate that antiphon and mark up any differences between the existing record and the one in the manuscript using elements from the Text Encoding Initiative (TEI) Guidelines (Sperberg-McQueen 2002) for creating a critical apparatus.
CURSUS manuscript source for antiphon c1208
§ 11 Figure 1 shows the local form of markup used in the manuscript file, in this case an excerpt from the Hyde Breviary (http://www.cursus.uea.ac.uk/ms/hyde). There are a number of <incipit> elements of various types, and an <antiphon> element containing an <xptr> . Incipits are short pieces of text which were used in the medieval manuscript as abbreviated markers for the full text. Although the text simply says 'Dicamus omnes', any medieval monk worth his salt would know to which hymn this referred. The problem comes when there are multiple hymns which start this way, and this may explain some of the variation in manuscripts when they are expanded. There are a number of XML entities used in this example: '&Y;', '&V;', '&AE;', '&P;', and '&O;'. Each of these is expanded in the associated DTD to represent a very consistent marker used in the manuscripts before each of these service items. So in the case of '&O;' this expands to <rubric type="oratio"> Oratio </rubric> . Almost all prayers are signalled by such a rubric in Latin, though it varies in how it is abbreviated with 'O.', 'Or.' and 'Orat.' being particularly common. We made an editorial decision to standardise these and use this entity, except where the use was particularly unusual. In those cases the rubric would be marked individually. The <antiphon> element contains an <xptr> whose 'from' attribute records the CAO ID number for the text of that antiphon. All the other elements represent a few standard liturgical items, though a full service would have many more of these.
CURSUS repository source for antiphon c1208
§ 12 Figure 2 below shows the version of this same antiphon as it appears in the CURSUS repository:
§ 13 This antiphon, which we have marked with an <ant> element to differentiate it from the <antiphon> elements encoded in the manuscript files, contains an 'id' attribute with the same CAO ID number, 'c1208'. In addition, as a result of the XSLT stylesheet which has processed this file, the CAO ID numbers of the next and previous items have been automatically added. This is not necessary, since they could be retrieved at any step in the processing, but having them stored here leads to more efficient processing of the individual items in their critical edition view. Moreover, retaining these makes it more straightforward to provide functionality such as 'Previous Antiphon' and 'Next Antiphon' links without needing to re-discover these while dynamically processing this file. Since these were automatically generated by the XSLT stylesheet which pulled out the individual service items from the repository, it has not cost much time or effort. But, in doing so, it makes the processing of these files by a number of other stylesheets much more simplistic. Sometimes building in redundancy where it is easiest can simplify later processing.
§ 14 The <ant> element also contains a <header> child element. This is used to store the usage of this antiphon for each of the manuscripts the CURSUS Project edited. Each <usage> element contains a manuscript sigil and the CANTUS feast code which indicates a particular liturgical day (and is recorded in the CURSUS editions of the manuscripts). In this case all of the manuscripts agree and use this antiphon on day '07041000' which in the file of CANTUS feast codes is listed as 'Dominica 4 Quadragesimae', which means the fourth Sunday of the season of Quadragesima or Lent. That all the CURSUS manuscripts (http://www.cursus.uea.ac.uk/manuscripts) that reference this antiphon use it on the same liturgical day is not at all surprising. This is quite common. What is more interesting to many liturgical scholars are those instances where a manuscript or few manuscripts might use it on a different day.
§ 15 The text of the antiphon follows in a <aBody> element (responds use an <rBody> but have a slightly more complicated structure), which has a 'wit' attribute which contains a white-space separated list of manuscript sigla whose critical readings are found in this antiphon. It should be noted that while the <header> contained <usage> elements for 8 manuscripts edited by the CURSUS Project, the 'wit' attribute on the <aBody> records 10 manuscripts. This is because the readings of two CAO manuscripts, CAO-F (The Antiphoner of Saint-Maur-les-Fosses) and CAO-G (The Durham Antiphoner), are also edited here.
CURSUS DTD and critical apparatus markup
§ 16 The composition of the structural elements for these items is important, and thus their content has been constrained by the CURSUS DTD. For example, <usage> is one of the few elements allowed inside this <header> and it is required to contain both an 'ms' and a 'code' attribute. Moreover, the value of the 'ms' attribute is required to be the sigil or ID of one of the manuscripts listed in the document's header. If such conditions aren't properly met then the document will not validate when passed through an XML parser.
§ 17 The differing readings for each particular manuscript are described using fairly standard TEI critical apparatus tags. Since there is no individual base text or agreed reading, no lemma is used. There is an <app> element that contains multiple <rdg> elements each with a 'wit' attribute listing the manuscript sigla for that reading. This allows any individual reading to be pulled out, but also enables comparisons among the readings of all of the manuscripts.
§ 18 One of the problems with the use of this form of critical apparatus elements occurs because of the incremental nature of the additions to the repository (see Bart 2006). If all the possible witness readings were available at the same time, then convenient divisions for the different readings could be established. However, because the manuscripts were edited one after another, the amount of text that the reading elements cover often had to be adjusted when a new manuscript's variance extended over the divisions of the original reading. This could have led to a problem of the different variations overlapping, and although this was very unusual, the simple solution was to separate out the problematic reading from the others. In these few cases, the very minor loss of detail in comparison was justified for the ease of implementation. The growing and shrinking of these elements could have been complex. However, the project produced a number of macros for the free and open source editing environment it was using which simplified this task. We created similar shortcuts for most of the repetitive editing tasks.
Modular organisation of CURSUS data
§ 19 Since our editions of the manuscripts pull their antiphons and responds and prayers from a separate repository file, it also seemed logical to do the same with biblical readings. It was not the point of the project to provide an edition of critical readings of the Bible. For this reason, and to speed up the amount of work undertaken by the project, we did not mark up any of the scribal variations for biblical readings. Instead we simply recorded the book, chapter, and verse numbers of the reading used. This meant that we could import the text we needed from a single standardised edition. In order to enable this functionality we needed a source text, so I transformed an edition of the the Latin Vulgate Bible that I was provided with into XML (http://www.cursus.uea.ac.uk/vulgate).
§ 20 The result of this is that the bulk of the text in the editions of the manuscripts, when processed for display, is brought in from other locations. Every single full antiphon, respond, prayer or biblical reading is not present in the original, but only pointed to by a TEI <xptr> element. In these very skeletal files, it is only textual elements that are unique to individual manuscripts that are stored in the file itself (for a conceptually similar approach, see O'Donnell 2005).
§ 21 A side effect of this modular organisation is that it makes it easy to see what other manuscripts contain the particular antiphon or respond a liturgical scholar might happen to be interested in. For example, the critical edition view of a particular antiphon includes not only all the variant readings, but also includes a list of links to which CURSUS edited manuscripts on which feast days have used that antiphon. Clicking on these links takes you to that liturgical day in the edited manuscript.
§ 22 In the edition of a manuscript, each individual reading of an antiphon, respond, or prayer is accompanied by a link to the critical edition of that item. Similarly, any biblical reading links to an edition of that book of the Bible. Incipits, abbreviated items that might be impossible to decipher with complete confidence, link to an index of similar service items starting with the same letter.
§ 23 This organization means that the site navigation is very circular, jumping from manuscript, to critical editions of antiphons, to individual manuscript readings, to other manuscripts, to the Latin Vulgate Bible. This freedom to explore and ability for recursive reading frees the user to work in ways which we might not have predicted.
§ 24 We chose to use this kind of modular framework because we found it suited the research enquiries that liturgical scholars might want to make of the data we were assembling, while leaving them free to explore it in ways we might not predict. But all of this might be for nought if we created a static website that did not allow a user to take advantage of this modularity. This left us with a problem of final presentation: how might we deliver this data in the most flexible and usable method?
Web publishing framework
§ 25 As with most small academic projects, the funded period of the project had a fixed time-scale and limited funding. This made freely available open source technologies very attractive. As it turned out, some of the products that best suited our needs happened to be open source.
§ 26 The University did not have a centralised server dedicated for web projects that was capable of providing the dynamic manipulation and interrogation of texts required by the CURSUS project. We set up one of the project's desktop machines as a Debian Linux server. This is an extremely powerful, free and open source operating system which is extremely stable.
§ 27 We wanted to be able to have virtual URLs, allowing our users to create dynamically-assembled pages from a variety of XML source files. For this reason we selected Apache's Cocoon, also free and open source, as our XML publishing framework. Cocoon allows for the true separation of content, logic and presentation. Depending on user input an XML manuscript can be processed and displayed in many various ways. Output can be serialised not only as HTML, but as PDF and RTF, amongst many other options. The programming logic for these transformations in our case was done entirely in XSLT, but could have been done equally well in Java or many other programming languages.
§ 28 Our next problem was how to search the data. Cocoon, of course, has various search options bundled with it, but in our desire to be cutting edge we also wanted to explore the provision of more complex searches based on the document structure. For this we chose one of the best available native XML databases, eXist, which is also free, open source, and has a wonderfully supportive community. Using eXist allowed us to preformulate XQuery or XPath searches from user input on web forms, including SQL-like complex joins between documents. eXist could also be easily embedded into our existing Cocoon installation, as eXist uses Cocoon itself in its standalone implementations.
§ 29 The point of describing the software we chose, and our reasons for selecting it, is to highlight that these technologies only cost the time it took to become acquainted with them. Moreover, we were not tied in to a closed proprietary solution which might limit our flexibility, but instead built applications on top of open standards. Given the limited funding of any academic project, the coupling of freedom and lack of direct cost is a crucial benefit.
CURSUS markup and the Text Encoding Initiative
§ 30 The nature of the markup used by the CURSUS Project deserves some explanation. I'm currently serving an elected term on the TEI Technical Council. As such, I would naturally suggest that it is generally a good idea to follow the TEI Guidelines (Sperberg-McQueen 2002). Using standardised guidelines for markup of texts allows a greater degree of interoperability between resources.
§ 31 But, unsurprisingly, the TEI does not have specialised elements for denoting antiphons, responds and prayers. So how can I claim that the CURSUS Project complies with the TEI Guidelines? One of the most important benefits of the TEI is that you can extend it where necessary to suit your own project's purposes. These antiphons, responds, and prayers are just paragraph-level divisions that could have been encoded using existing TEI structures (namely the <p> element). Syntactic sugar modification enabled us to create a more convenient taxonomy by distinguishing between these three types of paragraph-level divisions. This means when marking up an antiphon, which in "regular" TEI would be <p type="antiphon"> , we were able to mark it just as <antiphon> . In addition we added new elements by following the TEI P4 Guidelines for Modifying and Customizing the TEI DTD (http://www.tei-c.org/P4X/MD.html). The most recent version of TEI makes it even easier for projects to localise their use of the TEI (http://www.tei-c.org/P5/).
The Records of Early English Drama
§ 32 For the CURSUS Project, we customised the TEI to provide markup that reflected the nature of the documents themselves. We served up these documents from within a flexible and dynamic XML publishing framework that enables users to search the manuscripts, repository of antiphons, responds and prayers, as well as the entire copy of the Latin Vulgate Bible. And we did this without any funding for huge expensive servers or proprietary software solutions.
§ 33 What about work that has no funding whatsoever? In my
personal research I’ve been working with some volumes of the Records of Early
English Drama Project (http://www.reed.utoronto.ca/). Over the
last few decades, this project has been publishing volumes of extracts from
medieval account books that mention drama and related activities. As the
executive editor had the extreme foresight to retain for the project the
electronic rights to the material when arranging for publication with the
University of Toronto Press, there is keen interest within the project to
convert the material to a usable electronic format. The REED Project, and the
community that surrounds it, have always been drawn to the benefits of exploring
extracted records with computers, and using them to create research resources in
their own right. This is the case with past projects such as the
York Doomsday Project (King and Twycross 1995) and the REED Project's
Patrons and Performances Website (http://link.library.utoronto.ca/reed/).
REED print format
§ 34 Instead of using the REED material (which remains their copyright property), my examples here are taken from my doctoral dissertation (Cummings 2001), in which I transcribed records from a geographical area that the REED Project has not yet worked on, the LeStrange Household Accounts of Hunstranton (LeStrange Household Accounts, 1533-9). Nevertheless, I took the same steps with very similar data. Figure 4 shows some extracts in a format similar to the REED print publications:
REED typesetting codes
§ 35 Many of the printed volumes of REED records were created from a process which involved the files at one point being stored in an ASCII file of bespoke typesetting codes. Originally designed and implemented by Willard McCarty, this system was well designed for its intended purpose given the technology available at the time (McCarty 1984; Nelson 1983). In converting REED volumes to structurally encoded XML, it makes sense to start with these files where they are available. Unfortunately, we don't always have these original text files. Figure 5 shows what this might look like.
§ 36 Instead, if these files are in electronic form at all, usually they are in even worse formats, like early Quark typesetting files. As a worst case scenario, in my investigation of the possible routes of conversion of this legacy data, I decided to start with these Quark typesetting files. I thought it best to preserve as much of the original formatting as possible, because the REED volumes themselves use a significant number of font and style changes to indicate important palaeographical and textual aspects such as manuscript abbreviations that have been expanded. Thus, any loss of the typesetting of this information would be a loss of intellectual content. The main problem with such a conversion route for legacy material is that in most cases the data has been formatted for presentation rather than structure. Since we are dealing with the lowest semantic level possible— that of basic font and spacing changes—discovering even the basic structural units such as words can prove difficult. To up-convert such legacy data to a file which has a greater degree of structural markup is often quite problematic.
Converting REED legacy data
§ 37 After testing a range of existing solutions, I decided that the best transformation was to export the Quark files as PDF files, and then convert these to different formats. Recent versions of Adobe Acrobat allow one to save a PDF file as a type of XML. When tested, however, this has always yielded unsatisfactory results. In the end, the method I chose was to export the file from Adobe Acrobat as an RTF file which could then be read by OpenOffice. I chose this route because the other numerous PDF to RTF converters that are available each had various problems in converting the file. One commercial product even shifted the italics, used in REED to indicate expanded abbreviations, one character to the left. I had access to a computer which had a full version of Adobe Acrobat installed, so this tool was easily available.
§ 38 Once the files were in RTF, OpenOffice made a number of possible methods available to me. Since OpenOffice's default format is XML-based, I could have developed an XSLT-based solution to transform the presentational markup to structural markup. However, my attempts in this area showed that the REED formatting decisions meant that this would necessitate a more involved process, and that significant manual correction would always be necessary. As this was the case, I settled on a straightforward but manual method of global replace based on formatting where possible. This resulted in annotation of the text with ad hoc codes to indicate the beginning and end of those typographic conventions used in the REED volumes. For example, REED used square brackets ([ ]) to indicate a cancellation in the original, and a pipe (|) to indicate a change of folio in a passage of continuous prose. These special characters, along with most of the rest of the editorial apparatus, could be preserved through this method. The file was then saved as HTML and these codes converted to well-formed XML elements. Other options would have been to use the available "OpenOffice to TEI XML" filters to save the document as TEI XML, or to save the document as plain text. However, at the time these lost some of the formatting which transferred over to HTML well and then could be replaced using find and replace in an XML or text editor.
§ 39 The process of using an incremental ad hoc intermediate format is quite common in such conversions. In many cases it is difficult to know very much about the structure of the document until much of the conversion has been done, hence an incremental approach allows a greater ease of recovery from mistaken conversion steps. However, it highlights a basic problem in such attempts to manipulate legacy files: that until they are in some format for which tools exist, it is difficult to examine or validate their structure. And yet, one of the most pressing reasons for taking this circuitous route in conversion was that this research was not funded—I used only those resources that I already had available to me. These included OpenOffice which is a free suite of office applications including a word processor.
REED Ad Hoc conversion DTD
§ 40 Initially I did not adopt a known or standardised schema for my XML. I decided that while the existing standards like TEI XML were sufficient, I would attempt to define my own schema which reflected the structure of the REED extracts as printed in the volumes. So I created a local encoding schema which made sense based on my knowledge of the print volumes. I took the resulting local XML and imported it into an XML database. This allowed me to XQuery the data as a method to test whether the data model I had chosen allowed me to easily extract the required results. Figure 6 contains an example of this local XML format with additional information not in the original:
§ 41 There were a number of intermediate steps between these formats, where the successive additional layers of markup were applied. For example, the classification of an item as 'payment to minstrels' or the description of an ellipsis in the transcription as 'Three food purchases omitted'. This kind of information was not present in the original and was only added where it was known. In other cases, such as the provision of GIS mapping references, the data was mocked-up for this example to indicate the possibilities of what sort of data might be added in later passes through the document. In this sense the data is a hypothetical example of what could possibly be added given sufficient time and funding.
REED conversion and the Text Encoding Initiative
§ 42 Why did I create my own local format instead of just using an internationally recognised standard like TEI? Since the REED volumes are filled with extracts from manuscripts, my choice of markup mimicked the enclosing structure of these extracts. Although I did create a local encoding format, this did not mean I completely ignored the lessons of well thought-out standards. In many cases, I have borrowed elements from the TEI Guidelines, or renamed them to more clearly reflect the nature of the data being encoded, but preserved the underlying data model. The TEI's methods of encoding have been debated back and forth, and fleshed out to be applicable to almost any type of document. They have some very good ideas for encoding of textual information and there is no reason why one can’t benefit from their experience.
§ 43 In the end, once the data model proved robust enough for my purposes, I did indeed convert the data to TEI P4. Local encoding formats are extremely useful and exploit one of the basic and most powerful aspects of XML. One can make explicit the document structure in terms which are used by those who study that form of document. This is why the CURSUS Project extended the TEI to include antiphons and responds, and why in converting REED records I initially created my own DTD. However, once the documents are encoded there is no reason not to convert to a more recognised format and to reap the benefits of doing so.
§ 44 As mentioned earlier, in the most recent version of
the TEI Guidelines, TEI P5 (http://www.tei-c.org/P5/), the
use of local encoding variants will become even easier. Not only the ability
to include elements from other XML Namespaces, but the addition of new
elements, attributes, and the renaming of existing ones, will become
increasingly straightforward. A command-line and web interface to this, Roma
(http://www.tei-c.org/Roma/), allows easy customisation
of TEI modules and generation of a variety of Relax NG and W3C XML Schema
(this is similar to what the
Pizza Chef used to do for
TEI P4). Moreover, the TEI's efforts to provide a greater amount of
internationalisation mean that the schema elements and their accompanying
descriptions can be provided in a local language to make text encoding
easier for projects worldwide. One of the benefits of this
internationalisation is that those hired for data entry need not know
English to work for a project, and can use elements with names they
recognise; the resulting XML still can easily be transformed back to
standard 'International' TEI XML for interchange where needed.
§ 45 Instead of extending the TEI schema as the CURSUS Project did, what I did with the REED data was to pilfer bits and pieces of the data model manner in which the TEI does things. Only when there was no more encoding to be done—for my limited purposes—was the data converted to a standard suitable for use with existing tools. When conversions are straightforward XSLT transformations, the use of local encoding formats can be a very constructive method for the manipulation of legacy data into more accessible forms.
The Oxford Text Archive
§ 46 I currently work for the Oxford Text Archive (OTA) (http://www.ota.ox.ac.uk/). The OTA has been collecting academic-created electronic texts since 1976, and as such, we have a wide variety of texts in numerous different formats. More recently we have come to host the UK’s Arts and Humanities Data Service (AHDS) subject centre for literature, languages and linguistics (http://www.ahds.ac.uk/litlangling/). The OTA provides free long-term preservation for textual resources of a primary academic nature. Since our licence is non-exclusive, we are happy to act as a preservation repository for data that is also preserved elsewhere. We also provide advice to those intending to create electronic resources and have published a number of guides to good practice. (Morrison 2000; Wynne 2005)
OTA use of the TEI header
§ 47 In the existing workflow of the archive, all of the metadata for each resource is stored in a separate document with little content other than a <teiHeader> element. This is another example of how to use the TEI in a way it wasn’t originally conceived. It is not really a use of the so-called TEI Independent Header, because each of these documents forms a valid TEI document in itself, simply without any real content in the body. The TEI Header stores all the metadata concerning the electronic document including sections detailing the file's cataloguing details, availability, format, and a list of changes. Another reason to have a TEI Header as a separate file is so that the content of the text file can be dynamically added into the body at time of delivery. We separate the body of the document from the header itself, because the documents are in many different formats.
OTA legacy data migration pilot project
§ 48 The OTA undertook a pilot study to examine the formats of the archive and to consider the problems and possibilities of migrating this legacy data to more modern formats. The pilot study was undertaken by Monica Langerth Zetterman of Uppsala University and myself, with the aim of understanding what formats we had, their differences, what our priorities in conversion should be, and identifying possible routes for migration and enhancement of metadata. For this pilot study we limited ourselves to a subset of texts to examine. We decided that it was not a good idea to have a random sample through the archive, instead we wanted to choose a particular subject of text and see the range of formats within that subject. This needed to be a type of text I was familiar with, and have a regularly defined structure—so that the different forms of textual markup would be easy to distinguish. Moreover, it needed to have enough different formats that it was generally representative of the archive as a whole. In choosing the subset, since it was up to me, I decided that we should concentrate on early English drama. This would make it easy to understand the text structure, which generally has a format of lines of verse, within speeches, within scenes, within acts, with occasional other features like stage directions. Moreover, because it was a genre of text with which I was very familiar, it made the process of metadata enhancement that much easier. The selection of texts was determined based on the data-mining of the Library of Congress Subject Headings, which the archive stores within the TEI header for each and every deposit.
§ 49 In the end this sample provided a good range of representative formats available in the archive. Moreover, the pilot study provided much useful information concerning the length of time it took to identify text formats, the cases (upper/lower/mixed) that the texts were available in, the markup formats the archive contained, the processes involved in metadata enhancement, and some possible routes for conversion of certain formats of texts. It also reclassified a substantial number of texts in the pilot sample whose format had only previously been described as 'Unknown Markup'. In general the COCOA format is one of the archive's most common legacy data formats. For this reason the OTA developed a conversion methodology for verse drama encoded in COCOA.
Legacy data example: COCOA format
§ 50 The TEI Header associated with this non-TEI file indicates it was deposited in 1985 by David Gunby, the creator of the electronic version. As you can see this format uses a form of non-nesting markup. The start of particular elements (scenes, speeches, etc.) are marked, and then any content following this is assumed to be included in this until the next occurrence of the same type of element. Thus, a speech, 'Q' in this markup, runs until the next 'Q' starts. This means that no nesting is explicitly marked and has to be deduced at the time of processing. A similar state-variable approach has been adopted, more or less, in Microsoft's XML for Word documents and thus the nature of the conversion's need to deduce structure is a problem that may again become more common. There was very little convention in the naming of COCOA elements: although here 'SN' is used for 'scene', 'SSD' for 'short stage direction' and 'Q' for 'quote or speech', the use of these particular elements was not entirely standardised in COCOA markup. The user manual for the Oxford Concordance Program contains a basic introduction to COCOA and its later evolutions for use with the program (Hockey and Martin 1988).
§ 51 While the OTA's status as an archive means we will always keep a copy of the original deposit file, we eventually would like to migrate as many of the unconverted texts as possible to XML to allow them to be of more use to to those downloading the files. As with most technical problems, this is a question of time and money. While many new deposits are converted to more acceptable preservation formats, only a small portion of the older deposits have been migrated. In any conversion we attempt to ensure that all intellectual content of the original is preserved, and where possible are cautious about introducing new interpretation inherent in the markup.
Conversion: COCOA to COCOA-ML
§ 52 For verse drama in COCOA I've developed a mostly XSLT-based method to change texts such as this into filled TEI structures. This begins by using a simple Perl regular expression to modify each COCOA tag to become a well-formed empty XML element. The Perl script used to do this is shown as Figure 8 below but is also freely available (http://purl.org/cummings/research/cocoa2tei/COCOA2FlatCOCOA-ML.pl).
Conversion: COCOA-ML to flat TEI XML
§ 53 The same task could be accomplished equally well in XSLT2 and I have thought about substituting an XSLT2 processing step at this point to simplify the tool-chain requirements. The next step was to modify these COCOA-ML elements to be flat empty versions of the corresponding TEI elements for representing the same structures. The majority of this XSLT2 stylesheet is a simple non-structural conversion which renames the elements. It is these templates which would have to change to adapt to new COCOA elements for other types of texts. However, one part of it gives new structure: it is at this point where lines of text, unmarked in the original, are given structure through tokenizing based on the lineation of the text. They are marked as lines, or TEI <l> elements in the resulting output. The existence of the tokenize function in XSLT2 means that outside parser extensions are not needed for this conversion. A portion of the stylesheet to do this is shown below as Figure 9, but is also freely available in full (http://purl.org/cummings/research/cocoa2tei/FlatCOCOA-ML2FlatTEI-XML.xsl).
Conversion: Filling out flat TEI XML
§ 54 A final step in this conversion process expands these empty milestone-like elements to fully enclosing elements, while making sure to take account of elements (like <stage> ) which can appear inside the speeches and lines. This takes advantage of the XSLT2 grouping mechanisms and uses nested <xslt:for-each-group> to expand the structure of the document (Kay 2004). As verse drama is a fairly simplistic hierarchical structure, this is straightforward and one could add more levels of hierarchy by simply nesting more grouping in the XSLT stylesheet. Part of this stylesheet is shown below as Figure 10 but is also freely available in full (http://purl.org/cummings/research/cocoa2tei/FillFlatTEI-XML.xsl).
Conversion: Inherent problems
§ 55 The steps in this COCOA - TEI transformation are intentionally modular so that they can be used as steps in other conversions. The text resulting from this conversion pipeline still has the problems inherent in the original. For example, this text is only in upper case. While these days it might seem ridiculous to create a text in a single case, much legacy data exists in such a format. In many cases the only intent for the electronic version of the text was for linguistic research, where the production of word frequency tables, concordances and collocation lists could be accomplished without a need for mixed case. It is not only that the text is in upper case which remains problematic. In this conversion any character-based flagging has been left untouched. You can see an example in the second last line from the text extract: 'HOW^ DOGGE'. In this case, this circumflex is used almost certainly to replace an exclamation mark, but similar characters have been used as ad hoc codes by researchers to indicate the rendition of the original (e.g. italics), linguistic parts of speech, or other unknown aspects they might have been interested in studying. Unless documentation survives to indicate what such flagging means, it is safer, and certainly quicker, to pass it through the conversion unchanged. Another option could have been to mark such unusual characters with an element indicating the place and nature of the flagging, without assuming interpretative semantics of the original encoder of the document.
§ 56 There are more significant, but equally obvious, theoretical anxieties concerning the relationship of a converted text to its original, and how such a significant change of format modifies what we consider to be the nature of the text. In addition, there are questions as to whether older resources should be converted if alternative versions already exist in a 'better' format. These raise the thorny issue of comparing different versions of the same resource, and whether one could be considered 'better' simply because it is in XML. This is not clear cut: for example, it could be that one is mixed case and another single case, or that one is a more rigorous academic edition than the other. Any such decision takes not only time but academic judgement. The result of the conversion of this text is shown below as Figure 11:
OTA: Encouraging best practice
§ 57 The OTA offers free archiving and distribution of electronic resources, so that they do not simply vanish as servers die and academics retire. In addition, we also assess the technical appendices to the funding applications made to the Arts and Humanities Research Council (AHRC) (http://www.ahrc.ac.uk/) by UK scholars wishing to create digital resources. Since we also, quite often, advise these same scholars before they fill out their applications, we intentionally create a very green circle of best practice. It goes without saying that applications are assessed by the funding council primarily based on their academic merit, but this helps to ensure that those digital projects which they choose to fund have no significant barriers to their success. We encourage, wherever possible, the use of internationally-recognised open standards.
§ 58 One of the other benefits of the current arrangement in the UK, is that any scholar getting funding from the AHRC to produce a significant electronic resource must deposit a copy with the AHDS for long-term preservation.
§ 59 One common thread through this is that the influence of the TEI on all of these projects is undeniable. The CURSUS project extended the TEI to make the element names relevant for its encoding needs. My personal research into REED conversion pilfered bits and pieces of the TEI for its local encoding schema which made the eventual conversion to TEI easier. The Oxford Text Archive stores all its metadata in TEI. The first two of these use TEI to create electronic resources, the last uses it to both preserve and disseminate information about the resource. The point isn't that it is unusual that these three projects use, change, and extend the TEI—this is not only suggested, but encouraged, by the TEI Guidelines—but that these customisations can still be interoperable. The modification of the TEI to suit one's own project where necessary should be considered good practice if done so in a documented manner and according to the TEI Guidelines.
§ 60 Another common point is the use of open source software and open standards. The CURSUS Project used open source software because of its limited funding. My own research had no funding and so used whatever software was available, and most of this was open source. The Oxford Text Archive often advises funding applicants of open source technologies which may allow the same functionality as proprietary products, and especially encourages the use of open standards in proposed projects wherever possible.
§ 61 These projects have been ongoing, and their histories since this article was initially written indicate some valuable lessons. In the case of the CURSUS Project, after the end of the funding and my departure I helped to maintain it. Eventually it was moved to a server shared by a number of digital projects in the School of Music. This meant that it became increasingly difficult to repair problems which might require system administrator access, and so (at time of writing) the project's search facility is no longer working. Hopefully at some point a more recent version of eXist can be installed and the searching reinstated. The CURSUS Project's dynamic and circular system requires an extremely flexible framework, which no commercial product within the limited means of the project could provide. However, the main problem is that the complicated technological solutions to the requirements of the CURSUS Project meant that I, a contract research associate there for a limited time, was the only person who understood them enough to maintain them properly. Since my time is now taken up with other duties, I have not found the opportunity to re-immerse myself in the workings of the site, and politically negotiate necessary access privileges. Whenever possible, projects should ensure that permanent members of staff have sufficient technical knowledge and familiarity with the project's publication framework to perform the majority of maintenance that may be required. Failing that, budgeting for occasional ongoing support is rarely a poor idea. That the CURSUS Project chose to recruit someone with the necessary skills, and the desire to learn them, was foresighted. It might be more difficult for an established research group where these skills might not exist to acquire them. However, all the skills were learnt through online tutorials, supportive user-community mailing lists, and much trial and error. Hence, it could be argued that they could be acquired by any suitably motivated academic in the same way as other discipline-centred skill sets, such as learning Latin.
§ 62 In the case of my research into the conversion of REED volumes to an accessible electronic form, it has served one of its main purposes. Part of the intention, aside from my own curiosity, was to convince the REED Project of both the benefits of doing such a digitization and the reasons why it should be properly structurally encoded. They have been so interested in this work that an article of mine on the possibilities of web technologies has been included in a volume of articles examining the REED Project (Cummings Forthcoming: 2006). The REED Project is keenly interested in digitization and as a first step has made scanned versions of all of their volumes freely available at the Internet Archive (http://www.archive.org/). This material is released under a Creative Commons Attribution, Non-Commercial, Non-Derivative license, which should be suitable for most academic use of the works. It becomes easy for projects to share their materials with some rights reserved when they "skip the intermediaries" (http://creativecommons.org/learnmore/). The electronic volumes have not been created from the existing REED typesetting files, instead they have been created by scanning the print volumes. They are available in Deja-Vu format, which presents a zoomable image of the page and searching based on the scanned text. While this is not the most useful format— somewhat less functional than PDF—it is a compromise which makes the volumes available to a greater audience. There are utilities to output this in Deja-Vu's XML format. However, this is a line-by-line version of the OCR text and as such loses much of the intellectual content (such as the expansion of abbreviations) found in the REED volumes. However, REED has indicated that this is a first step, that they understand the benefits of structural encoding, and that they wish to move in this direction eventually. Sometimes the benefits gained by compromising on open standards is the difference between the availability of a resource or not.
§ 63 The OTA has not received any extra funding for the migration of legacy resources, but will continue to convert incoming resources where needed to appropriate preservation formats and occasionally migrate some legacy data which dates from before it did this. It will continue to apply for funding from other funding bodies. Sometimes, even with the best of intentions, worthwhile projects do not get funded.
§ 64 Although CURSUS, REED and the OTA each has seemingly very different approaches, they are all rooted in the same desire to enable use of resources in a way that will provide a greater flexibility of research and mean these resources are available in readable formats for years to come.
Cummings, James C. 2001. Contextual studies of dramatic records in the area around The Wash, c. 1350-1550. Leeds: School of English, University of Leeds. Available at http://purl.org/cummings/research/phd.html.
───. 2005. Scripts and stylesheets for COCOA verse drama to TEI P4 XML conversion. Available at http://purl.org/cummings/research/cocoa2tei/.
───. Forthcoming: 2006. REED and the possibilities of web technologies. In REED in Review, ed. Sally-Beth Maclean and Audrey Douglas. Studies in Early English Drama. Toronto: University of Toronto Press.
Massinger, Philip. 1633. A new way to pay old Debts , Oxford Text Archive: 0603. Electronic edition of A new way to pay old debts : a comoedie as it hath beene often acted at the Phoenix in Drury-Lane, by the Queenes Maiesties servants. London: Printed by E.P. for Henry Seyle. British Museum: Ashley 1123.
Morrison, Alan, Michael Popham, and Karen Wikander. 2000. Guide to good practice 1: Creating and documenting electronic texts. Oxford University: Oxford Text Archive, University of Oxford. Available at http://ota.ox.ac.uk/documents/creating/.
Sperberg-McQueen, C. M. and L. Burnard, eds. 2002. TEI P4: Guidelines for electronic text encoding and interchange. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium. XML Version. Available at http://www.tei-c.org/P4X/