Workflow and Processing at the Institute for Name-Studies

I started working with the Institute for Name-Studies (henceforth INS) on the Digital Exposure of English Place-Names in June, and this project has come a long way since then. One of the initial tasks of such a large collaborative, inter-universitary project was establishing a workflow between each team. These have been organised and subsequently labelled as ‘stages’: each volume of the Survey goes through eight stages before it is ready for the website. Here at the INS we play a crucial role in the quality control of each stage of the project, and I will outline our procedures and outcomes.

            Our first principal role at the INS is to ‘process’ the volumes so that they can go on to the next stages to ensure that the material will be searchable once it is online. This means that historical forms that have been abbreviated in the volume need to be accurately reconstructed and relevant footnotes copied and pasted into the text. As the digitisation process began in full, we soon came to realise that it was easier for us to accomplish these tasks by doing this work in a Microsoft Word format as opposed to using Oxygen to edit the volume in xml. Once the OCRing process has been completed by CDDA, we are sent a word file of each volume. At this Stage (known to the team as Stage 1 INS) we go through and check the high-level tagging structure for the entire volume and make any necessary amendments or changes, expand the abbreviated historical forms (or shortforms as we call them), and insert footnotes or information from footnotes if applicable. Each volume has its own peculiarities, and different editors use different styles and conventions. The process of learning and working with these different conventions has posed a number of unique problems, but simultaneously has been very rewarding because we have become familiar with the practices of past editors and better understand the Survey and its history. The shortforms in particular are sometimes problematic. In general there are three types of shortforms used throughout the Survey: ones indicated by a hyphen(s), shortforms in brackets and some where distinguishing words are separated by commas. The first category, hyphenated shortforms, are by far the most prolific, and in general they pose very few problems. The second type, bracketed shortforms, are commonly used to indicate either a prefix or a suffix to the name. Different editors have taken different approaches to these, however. The editor A. H. Smith, for example, had at least seven different interpretations for a bracketed shortform. The final type are those indicated by commas separating the qualifying elements from the generic, e.g. Great, Little Field where the names indicated are Great Field, Little Field. This convention was used in some volumes, both early and late, but not all. So far of the 30 volumes processed through Stage 1 INS we have expanded a total of 37,531 shortforms. We intend to use crowdsourcing to help us complete this element of the project, and a platform for this is currently being designed.

            Once we have finished this stage at the INS the word file and a spreadsheet with comments are sent out to the team and the file is then run through the first visualisation process. This is sent back to us at the INS, where we check the file for any line break errors and anything else we may not have spotted during the first stage. These issues are then noted and sent back to the team. At this stage, LTG at Edinburgh run this through the detailed tagging process. A visualisation of this file and the accompanying xml is then sent back to us at the INS, where we check detailed samples of parishes for re-occuring issues and where possible try to discern the reason why these were not captured. The particular type of issues that we look out for include: any problems with names, historic forms and sources, cross references, glosses, language and etymology. Since May, the LTG tool used in the tagging process has been refined to the point where many of these issues noted above now occur very infrequently.  

            At this stage, if consistent errors are found the file is sent back to LTG for refinement, and if it passes then it is sent to KCL to begin the MADs process. When a volume has been through MADs, a visualisation of the file is generated and sent back to INS. At this stage, we check the visualisation to ensure that all the Gazetteer elements are being captured correctly.

            In summary, this is a brief overview of the work that we do at the INS.

This entry was posted in research. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *