Stuart Dunn (of the Centre for e-Research at Kings College London) organised a stimulating workshop on the Edinburgh Geoparser. We discussed the work done extracting and mapping location references in several recently digitised archives (including the Stormont Papers, debates from the Stormont Parliament which ran in Northern Ireland from 1921 to 1972.)
Paul Ell talked about the role of the Centre for Digitisation and Data Analysis in Belfast in accelerating the “digital deluge” – over the last 3 or 4 years they have seen a dramatic decrease in digitisation cost, accompanied by an increase in quality and verifiability of the results.
However, as Paul commented later in the day, research funding invested in “development of digital resources has not followed through with a step change in scholarship“. So the work by the Language Technology Group in the Edinburgh geoparser, and other research groups such as the National Centre for Text Mining in Manchester, becomes essential to “interrogate [digital archives] in different ways”, including spatially.
“Changing an image into knowledge“, and translating an image into a machine-readable text is only the beginning of this process.
There was mention of a Westminster-funded project to digitise and extract reference data from historic Hansards (parliamentary proceedings) – it would be a kind of “They Worked For You”. I found this prototype site which looks inactive and the source data from the Hansard archives – perhaps this is a new effort at exploiting the data-richness in the archives.
The place search service used was GeoCrossWalk, the predecessor to Unlock Places. The Edinburgh Geoparser, written by the Language Technology Group in the School of Informatics, sits behind the Unlock Text geo-text-mining service, which uses the Places service to search for places across gazetteers.
Claire Grover spoke about LTG’s work on event extraction, making it clear that the geoparser does a subset of what LTG’s full toolset is capable of. LTG has some work in development extracting events from textual metadata associated with news imagery in the NewsFilmOnline archive.
This includes some automated parsing of relative time expressions, like “last Tuesday”, “next year”, grounding events against a timeline and connecting them with action words in the text. I’m really looking forward to seeing the results of this – mostly because “Unlock Time” will be a great name for an online service.
The big takeaway for me was the idea of searching and linking value implicit in the non-narrative parts of digitised works – indexes, footnotes, lists of participants, tables of statistics. If the OCR techniques are smart enough to (mostly) automatically drop this reference data into spreadsheets, without much more effort it can become Linked Data, pointing back to passages in the text at paragraph or sentence level.
At several points during the workshop there were pleas for more historical gazetteer of placename and location information, available for re-use outside a pure research context (such as enriching the archives of the Northern Irish assembly). Claire raised the intriguing possibility of generating names for a gazetteer, or placename authority files, automatically as a result of the geo-text-parsing process – “the authority file is in effect derived from the sources”.
At this point the idea of a gazetteer goes back beyond simply place references, to include references to people, to concepts, and to events. One could begin to call this an ontology, but for some that has a very specific technical meaning.
The closing session discussed research challenges, including the challenge of getting support for further work. On the one hand we have scholarly infrastructure, on the other scholarly applications. There are a breadth of disciplines who can benefit from infrastructure, but they need applications; applications may be developed for small research niches, but have as yet unknown benefit for researchers looking at the same places or times in different ways.