We’re back

Our many followers on this blog will be pleased to know that we’re back! After some technical problems that took a while to resolve the blog has now returned.

During the downtime there have been some significant developments for the DEEP project, not least that 21 volumes have now been completed with others coming thick and fast. Soon we’re be beta testing the new gazetteer website. Let us know if you’d like to help test the site out. We’re pleased with the interface and functionality but are sure we can do better.

In the meantime we’ll be updating you on the progress made in more detail.

Paul

 

Posted in Uncategorized | Leave a comment

Workflow and Processing at the Institute for Name-Studies

I started working with the Institute for Name-Studies (henceforth INS) on the Digital Exposure of English Place-Names in June, and this project has come a long way since then. One of the initial tasks of such a large collaborative, inter-universitary project was establishing a workflow between each team. These have been organised and subsequently labelled as ‘stages’: each volume of the Survey goes through eight stages before it is ready for the website. Here at the INS we play a crucial role in the quality control of each stage of the project, and I will outline our procedures and outcomes.

            Our first principal role at the INS is to ‘process’ the volumes so that they can go on to the next stages to ensure that the material will be searchable once it is online. This means that historical forms that have been abbreviated in the volume need to be accurately reconstructed and relevant footnotes copied and pasted into the text. As the digitisation process began in full, we soon came to realise that it was easier for us to accomplish these tasks by doing this work in a Microsoft Word format as opposed to using Oxygen to edit the volume in xml. Once the OCRing process has been completed by CDDA, we are sent a word file of each volume. At this Stage (known to the team as Stage 1 INS) we go through and check the high-level tagging structure for the entire volume and make any necessary amendments or changes, expand the abbreviated historical forms (or shortforms as we call them), and insert footnotes or information from footnotes if applicable. Each volume has its own peculiarities, and different editors use different styles and conventions. The process of learning and working with these different conventions has posed a number of unique problems, but simultaneously has been very rewarding because we have become familiar with the practices of past editors and better understand the Survey and its history. The shortforms in particular are sometimes problematic. In general there are three types of shortforms used throughout the Survey: ones indicated by a hyphen(s), shortforms in brackets and some where distinguishing words are separated by commas. The first category, hyphenated shortforms, are by far the most prolific, and in general they pose very few problems. The second type, bracketed shortforms, are commonly used to indicate either a prefix or a suffix to the name. Different editors have taken different approaches to these, however. The editor A. H. Smith, for example, had at least seven different interpretations for a bracketed shortform. The final type are those indicated by commas separating the qualifying elements from the generic, e.g. Great, Little Field where the names indicated are Great Field, Little Field. This convention was used in some volumes, both early and late, but not all. So far of the 30 volumes processed through Stage 1 INS we have expanded a total of 37,531 shortforms. We intend to use crowdsourcing to help us complete this element of the project, and a platform for this is currently being designed.

            Once we have finished this stage at the INS the word file and a spreadsheet with comments are sent out to the team and the file is then run through the first visualisation process. This is sent back to us at the INS, where we check the file for any line break errors and anything else we may not have spotted during the first stage. These issues are then noted and sent back to the team. At this stage, LTG at Edinburgh run this through the detailed tagging process. A visualisation of this file and the accompanying xml is then sent back to us at the INS, where we check detailed samples of parishes for re-occuring issues and where possible try to discern the reason why these were not captured. The particular type of issues that we look out for include: any problems with names, historic forms and sources, cross references, glosses, language and etymology. Since May, the LTG tool used in the tagging process has been refined to the point where many of these issues noted above now occur very infrequently.  

            At this stage, if consistent errors are found the file is sent back to LTG for refinement, and if it passes then it is sent to KCL to begin the MADs process. When a volume has been through MADs, a visualisation of the file is generated and sent back to INS. At this stage, we check the visualisation to ensure that all the Gazetteer elements are being captured correctly.

            In summary, this is a brief overview of the work that we do at the INS.

Posted in research | Leave a comment

March 2012 Advisory Board

The minutes of the project Advisory Board meeting held in March 2012 are now available. Click here to view them.

Posted in Uncategorized | Leave a comment

Checking the OCR samples

Having just proof-read OCR samples from 6 county survey volumes (my first task on this project), I have been astonished at the sheer quality of the OCR output. Scanning technology has certainly come a long way since I first saw grainy scanned texts over 20 years ago. The samples have come from early volumes, with all the attendant issues of unavailability of certain fonts/characters and uneven inking on the page (which makes it hard to distinguish bolded from unbolded text). The OCR technology seems to have dealt very well with these difficulties and faithfully reproduced pretty much everything on the page – to the point where some things (such as the printers’ marks at the foot of some pages) will have to be removed. The main issues arising are missing macrons from place name elements; the treatment of footnotes; and how addenda/corrigenda will be incorporated. These issues are unrelated to the OCR process and will require an editorial decision.

Posted in Uncategorized | Leave a comment

Update on the automatic processing

We are making steady progress on building the automatic processing pipeline for converting from OCR output to heavily annotated XML texts. The pipeline makes successive passes over an EPNS volume with each stage adding further layers of annotation.

The first step converts from a Word file to XML and retains the high level structural mark-up that CDDA added by hand. In the input, each line is a separate paragraph and this step considers sequences of lines and creates proper paragraphs around running text, reversing line-end hyphenation at the same time.

The second step identifies word and punctuation tokens, retaining information about font style and weight as attributes on the token elements. Sentence splitting is applied in running text paragraphs at this stage.

In the third step some of the higher level elements are segmented into smaller parts: field and street name sections are split into their individual members and long paragraphs containing minor place names, some with attestation or etymological information, some without, are split into individual entries. The name of each place is computed at this point and stored in an attribute on the relevant entry. While most of these names are straightforwardly recognisable, several variants can be compressed in quite complex ways (e.g. Woodhouse, (-End & -Green); The Dighills, Dighill Brook & Wood; Fanshawe (Lane), Fanshawe Brook (Fm); Whirleybarn, Whirley Cottages, Grove & Rd). At this stage just the extent of the name is found, leaving later stages to expand the names into all their variants.

The fourth step does finer-grained segmentation of the information associated with places. Attestations are place name forms linked to dates and sources in which they were attested. E.g. for the township Bramhall in the parish of Stockport a previous form is shown like this:

Bromehale 1426 Plea, 1433 ChRR

meaning that the form Bromehale was recorded in 1426 in the source abbreviated ‘Plea’ (Plea Rolls of the County of Chester) and in 1433 in the source abbreviated ‘ChRR’ (Calendar of the Chester Recognizance Rolls). The etymology of Bramhall is glossed as ‘Broom-nook’ from the elements br?m and halh. Each of these pieces of information (form, date, source, gloss, place name element) is identified and marked up with an appropriate XML element. To identify the sources we use a lexicon derived from the abbreviations list for the relevant county and, once identified, we add the id of the source to provide a link to the full form of the abbreviation. For each place name element, we query a local copy of the Key to English Place Names database and add the relevant database id if it is found.

We are still working on subsequent steps, with two main things still to do. The first is expansion of shorthand ways of recording alternative forms, e.g. expanding Fallibro(o)me, -y-, -bro(m) to give the forms Fallibrome, Fallibroome, Fallybrome, Fallibrom, Fallibro. The second outstanding step is georeferencing the places. Here we need to convert any old OS map references provided by EPNS to lat/long and we will also georeference the parishes and major place names against the Unlock gazetteer using the Edinburgh Geoparser. After verification of the georeferences, it will only take a small transformation to create entries in the DEEP historical gazetteer.

Posted in Uncategorized | Leave a comment

DEEP in numbers

Following my talk at yesterday’s GeoCulture seminar in London, I thought I’d post some figures about the SEPN content which DEEP is digitising:

  • 80+ years of scholarship
  • 32 English counties
  • 86 volumes
  • 6157 elements
  • 30,517 pages
  • c. 4,000,000 individual place-name forms
  • ??? Bibliographic references (we will know soon – it’s quite a lot)
Posted in Uncategorized | Leave a comment

Digitisation as research

With the REF on the horizon, most academics are currently concerned with matters of impact and academic recognition. Therefore, getting academic recognition for a digitisation project, such as those funded under the JISC eContent programmes, is an important question. In order to receive JISC funding to digitise content, one has, of course, to demonstrate the academic value of the resource to be digitised, and to explain how making it available digitally will increase that value. The impact and value of digitisation outputs themselves, and how they fit into peer-review structures, has been the subject of previous studies, but the issue of getting credit for undertaking digitisation itself is less clear. This can cause problems when dealing with outside bodies concerned with the review or evaluation of research; or even with one’s own institution. In some cases, for example, digitisation activities might be interpreted as software development or IT support, thus preventing those involved from getting academic credit. How this classification is made varies from HEI to HEI. In some cases, an email from the PI or Co-I confirming that the project is ‘research’ will suffice, in others there is a questionnaire or some other pro forma. However they classify activities, most Higher Education Institutions adopt the principles of the Frascati Manual’s definition of research, or something very similar to them. These break research down into three headings:

  • Basic research is experimental or theoretical work undertaken primarily to acquire new knowledge of the underlying foundation of phenomena and observable facts, without any particular application or use in view.
  • Applied research is also original investigation undertaken in order to acquire new knowledge. It is, however, directed primarily towards a specific practical aim or objective.
  • Experimental development is systematic work, drawing on existing knowledge gained from research and/or practical experience, which is directed to producing new materials, products or devices, to installing new processes, systems and services, or to improving substantially those already produced or installed

Most academic digitisation work is likely to fall into the third category, provided that making available of digital resources is accompanied by some form of enhancement, such as machine-readable mark-up or a crowd-sourcing platform. This is especially so if it can be shown that the enhancement is drawn directly from the project team’s experience and expertise. Certainly in the context of the DEEP project, there are complicated questions of data structure, interpretation and mark-up, the exploration of which would appear as research questions to most scholars and deserving of recognition as such. Undoubtedly they require the extremely interdisciplinary skill set of all the partners.

Projects needing to make this argument may wish to consider the following suggestions:

1. Ensure the research question or questions that your resource will be addressing is clearly articulated, and that you have to hand a clear statement describing the unique knowledge needed to make it digitally available in the way you have chosen.

2. Refer to the Frascati guidelines, and any relevant institutional definitions of research and related activities.

3. Ensure you are talking to the right person. It may be the case that staff charged with classifying activities are not familiar with digitisation. This is especially so in departments or schools with little experience of such projects. In such cases, the decision on whether to classify the project as research may well need to be taken at a higher level than normal.

Both the Centre for Data Digitisation and Research at QUB and the Centre for e-Research in the Dept. of Digital Humanities at KCL have extensive experience in dealing with such projects, and would be happy to offer discussion and advice to any project which needs to make the argument that their work constitutes research.

 


 

Posted in research | Tagged , , , | Leave a comment

Consortium Agreemenent

Happy to report that the DEEP project consortium agreement between KCL, Queens, Edinburgh (both LTG and EDINA) and Nottingham has been agreed and signed.

Posted in Uncategorized | Leave a comment

Starting digitisation

The DEEP project started on time in November last year. Our project plan has been finalised, and will shortly be available from our page on the JISC website.

As it promised, the Survey of English Place-names (SEPN) is a complex and fascinating document. Produced by the English Place-Name Society (EPNS), the SEPN is a true community effort. Its 86 volumes document the names of some 40 English counties, and have been compiled by different place-name scholars over the years. Thus, a succession of different people have moulded the text itself to fit and reflect England’s ancient and rich toponymic landscape.

While this provides an unrivalled resource for the place-name scholar, the historian, the geographer and the linguist, this makes digitizing it a challenge. Our aim is to put the forms into a structured gazetteer, but the structure varies from county to county. The basic hierarchy goes from large units, such as counties and hundreds, to smaller units, such as parishes, townships, settlements and minor names. Some conventions persist. Parish names are mentioned as headings for example, followed by townships and settlements, but there are inevitable exceptions, which makes tagging these sections of text complex – we do not wish to impose artificial structures on anomalous portions of text, since they will all be anomalous for a reason.

OCRing the text is the responsibility of CDDA. This process has thrown up problems, for example in some cases matching Anglo Saxon characters to their supported Unicode equivalents requires expert input from the team at Nottingham. Sometimes AS characters are simply hard to read due to printing issues, sometimes the problem is that the Unicodes themselves need correcting. E.g. a character initially assigned Unicode E624 was misread and reassigned 01ED (?).
Cheshire is now completed, and work is underway on Shropshire.

Posted in Uncategorized | Leave a comment

Updated project description

Here is our updated description for the DEEP project:

Place-names are not static. They change and evolve over time, in response to the development of language, wars and conquests, shifting administrative boundaries, or simply the vagaries of spelling in the days before dictionaries and atlases. They have complex etymologies derived from different languages, and they mean different things to different communities. Therefore, historical documents and archives, ephemera and sources, contain different spellings (forms) of place-names, depending on their date and context. However – and despite the fact that we now take for granted the ability to search geographic data using web services such as Google Maps and GeoNames.org – there is no gazetteer documenting these historic name forms. Therefore, there is no means of linking or cross-searching the geographic references they contain. In summary, a search using a modern place-name will not currently return results for that name in all its many variant forms. This has resulted in a major underutilisation of electronic resources.

Digitisation, however, offers a solution. In England, the historical developments of place-names over time have been systematically surveyed since 1922 by the specialists of the English Place-Name Society (EPNS). Examining an extensive range of documentary sources in local and national archives, and gathering the knowledge of local communities and experts, the EPNS has built up an 86-volume county by county survey of England’s place-names – detailing over four million variant forms, from classical sources, through the Anglo-Saxon period and into medieval England and beyond to the modern period. JISC’s Digital Exposure of English Place-Names (DEEP) project will digitise all these forms, and make them available as structured data. The corpus will be comprise a gazetteer within JISC’s Unlock service, meaning that researchers will be able to cross-query the dataset, and use it to search their own digital documents and databases for any historic place-name form. The gazetteer data will also be made available in structured XML, meaning that it will be possible to experiment with methods of data mining and visualisation that are not possible with the paper volumes. In addition to the digitisation, a network of experts will be convened to correct and enhance the dataset.

The completed resource will provide a key piece of electronic infrastructure for the discovery, clustering, use and analysis of e-content referenced by place. It will also be an important resource for scholars of place-names, and scholars in cognate disciplines such history, linguistics, archaeology, and historical geography.

DEEP has had a long gestation period, and as such it is a logical extension of existing work. Its context is significant existing investment which JISC has made in various forms of gazetteers and geospatial web services such as GeoCrosswalk, GeoDigRef, and Unlock. Principally, it grew from the Connecting Historical Authorities with Linked data, Contexts and Entities (CHALICE), funded in 2010 under JISC’s Information Environment Programme, and led by EDINA. In this exemplar project, the current project team carried out a full pilot demonstrator. This exemplar digitised the place-names of Cheshire, and a sample of those of Shropshire, and extracted place-name, attestation and chronological data from them using the Edinburgh geoparser, and generated a gazetteer of historic place-names to link documents and authority files in Linked Data form.  This proved the concept that is being rolled out under DEEP but, as an exemplar was constrained by limitations on time and resources. As a result, methodological challenges have been resolved and the team has a proven track record of working together..

 

Posted in Uncategorized | Leave a comment