Linking Placename Authorities

April 9, 2010


Putting together a proposal for JISC call 02/10 based on a suggestion from Paul Ell at CDDA in Belfast. Why post it here? I think there’s value in working on these things in a more public way, and I’d like to know who else would find the work useful.

Summary

Generating a gazetteer of historic UK placenames, linked to documents and authority files in Linked Data form. Both working with existing placename authority files, and generating new authority files by extracting geographic names from text documents. Using the Edinburgh Geoparser to “georesolve” placenames and link them to widely-used geographic entities on the Linked Data web.

Background

GeoDigRef was a JISC project to extract references to people and places from several very large digitised collections, to make them easier to search. The Edinburgh Geoparser was adapted to extract place references from large collections.

One roadblock in this and other projects has been the lack of open historic placename gazetteer for the UK.

Placenames in authority files, and placenames text-mined from documents, can be turned into geographic links that connect items in collections with each other and with the Linked Data web; a historic gazetteer for the UK can be built as a byproduct.

Proposal

Firstly, working with placename authority files from existing collections, starting with the existing digitised volumes from the English Place Name Survey as a basis.

Where place names are found, they can be linked to the corresponding Linked Data entity in geonames.org, the motherlode of place name links on the Linked Data web, using the georesolver component of the Edinburgh Geoparser.

Secondly, using the geoparser to extract placename references from documents and using those placenames to seed an authority file, which can then be resolved in the same way.

An open source web-based tool will help users link places to one another, remove false positives found by the geoparser, and publish the results as RDF using an open data license.

Historic names will be imported back into the Unlock place search service.

Context

This will leave behind a toolset for others to use, as well as creating new reference data.

Building on work done at the Open Knowledge Foundation to convert MARC/MADS bibliographic resources to RDF and add geographic links.

Making re-use of existing digitised resources from CDDA to help make them discoverable, provide a path in to researchers.

Geonames.org has some historic coverage, but it is hit and miss (E.g. “London” has “Londinium” as an alternate name, but at the contemporary location). The new OS OpenData sources are all contemporary.

Once a placename is found in a text, it may not be found in a gazetteer. The more places correctly located, the higher the likelihood that other places mentioned in a document will also be correctly located. More historic coverage means better georeferencing for more archival collections.


The Edinburgh Geoparser and the Stormont Hansards

March 4, 2010

Stuart Dunn (of the Centre for e-Research at Kings College London) organised a stimulating workshop on the Edinburgh Geoparser. We discussed the work done extracting and mapping location references in several recently digitised archives (including the Stormont Papers, debates from the Stormont Parliament which ran in Northern Ireland from 1921 to 1972.)

Paul Ell talked about the role of the Centre for Digitisation and Data Analysis in Belfast in accelerating the “digital deluge” – over the last 3 or 4 years they have seen a dramatic decrease in digitisation cost, accompanied by an increase in quality and verifiability of the results.

However, as Paul commented later in the day, research funding invested in “development of digital resources has not followed through with a step change in scholarship“. So the work by the Language Technology Group in the Edinburgh geoparser, and other research groups such as the National Centre for Text Mining in Manchester, becomes essential to “interrogate [digital archives] in different ways”, including spatially.

Changing an image into knowledge“, and translating an image into a machine-readable text is only the beginning of this process.

There was mention of a Westminster-funded project to digitise and extract reference data from historic Hansards (parliamentary proceedings) – it would be a kind of “They Worked For You”. I found this prototype site which looks inactive and the source data from the Hansard archives – perhaps this is a new effort at exploiting the data-richness in the archives.

The place search service used was GeoCrossWalk, the predecessor to Unlock Places. The Edinburgh Geoparser, written by the Language Technology Group in the School of Informatics, sits behind the Unlock Text geo-text-mining service, which uses the Places service to search for places across gazetteers.

Claire Grover spoke about LTG’s work on event extraction, making it clear that the geoparser does a subset of what LTG’s full toolset is capable of. LTG has some work in development extracting events from textual metadata associated with news imagery in the NewsFilmOnline archive.

This includes some automated parsing of relative time expressions, like “last Tuesday”, “next year”, grounding events against a timeline and connecting them with action words in the text. I’m really looking forward to seeing the results of this – mostly because “Unlock Time” will be a great name for an online service.

The big takeaway for me was the idea of searching and linking value implicit in the non-narrative parts of digitised works – indexes, footnotes, lists of participants, tables of statistics. If the OCR techniques are smart enough to (mostly) automatically drop this reference data into spreadsheets, without much more effort it can become Linked Data, pointing back to passages in the text at paragraph or sentence level.

At several points during the workshop there were pleas for more historical gazetteer of placename and location information, available for re-use outside a pure research context (such as enriching the archives of the Northern Irish assembly). Claire raised the intriguing possibility of generating names for a gazetteer, or placename authority files, automatically as a result of the geo-text-parsing process – “the authority file is in effect derived from the sources”.

At this point the idea of a gazetteer goes back beyond simply place references, to include references to people, to concepts, and to events. One could begin to call this an ontology, but for some that has a very specific technical meaning.

The closing session discussed research challenges, including the challenge of getting support for further work. On the one hand we have scholarly infrastructure, on the other scholarly applications. There are a breadth of disciplines who can benefit from infrastructure, but they need applications; applications may be developed for small research niches, but have as yet unknown benefit for researchers looking at the same places or times in different ways.

Links:
Embedding GeoCrossWalk final report (PDF)


A very long list of census placenames

February 9, 2010

Nicola Farnworth from the UK Data Archive sent us a motherlode of user-contributed UK placenames – a list extracted from the 1881 census returns. The list is 910096 lines long.

A corner of a page of a census record

Many placenames have the name of a containing county, though some don’t. The data is full of errors, mistakes in the original records, mis-heard names, maybe errors in transcription.

This census placename data badly needs a quality audit; how can Unlock Places help provide links to location references and clean up messy location data?

I made a start at this over the weekend, because I also wanted an excuse to play with the redis nosql data store.

To start, I threw the list of unique placenames against the geonames.org names in the Unlock Places API. The gazetteer is used to ground the placename list against known places, rather than search for exact locations at this stage, we look for known-to-exist-as-place names. The search function I used, closestMatchSearch, does a fulltext search for very close matches. It took getting on for 36 hours to run the whole lot.

unique placenames: 667513
known by geonames: 34180
unknown by geonames: 633333

We might hope for more, but this is a place to start. On manual inspection I noticed small settlements that are definitely in OpenStreetmap’s data. The Ordnance Survey 50K gazetteer, were it open data, would likely yield more initial matches.

Next, each of the unlocated placenames is compared to the grounded group of places, and if one name is very similar to another (as measured by Levenshtein distance with a handy python module) then a reference is stored that one place is the sameAs another.

Based on the results of a test run, this string similarity test should yield at least 100,000 identities between placenames. Hard to say at this stage how many will be in some kind of error (Easton matching Aston), 1 in 20 or hopefully many fewer.

place:sameas:WELBOURN : place:WELBURN
place:sameas:WELBOURY : place:WELBURY
place:sameas:ALSHORNE : place:ASHORNE
place:sameas:PHURLIGH : place:PURLEIGH
place:sameas:LANGATHN : place:LLANGATHEN
place:sameas:WIGISTON : place:WIGSTON
place:sameas:ALSHORPE : place:ASHOPE
place:sameas:PELSCHAM : place:ELSHAM

As I next stage, I plan to run the similarity test again, on the placenames derived from it in the first stage, with a higher threshold for similarity.

This should start getting the placenames yet to be located down to a manageable few hundred thousand. I hope to run the remaining set against OpenStreetmap’s Nominatim geocoding search service. I should probably write to them and mention this.

There’s more to be done in cleaning and splitting the data. Some placenames are really addresses (which may well turn up through Nominatim) others are sub-regions or suburbs attached to other placenames, north/south/east/west prefixes.

What next?

Ultimately there will be a large set of possible placenames, many tens of thousands, which aren’t reliably found in any gazetteer. How to address this?

A human annotator can be assisted by programs. We have a high threshold of acceptance for similarity of names for automatic link creation; we can lower that threshold a lot if a human is attesting to the result.

We can also look at sound similarity algorithms like soundex and metaphone. There are concerns that this would have an unacceptable rate of false positives, but if a human annotator is intervening anyway, why not show rough-guess suggestions?

A link back to the original source records would be of much benefit. Presumably the records come in sequences or sets which all deal with the same geographic region, more or less. By looking at clusters of placenames in a set of related documents, we can help pinpoint the location on a map (perhaps even pick out a name from a vector map layer).

Records with unknown placenames can be roughly located near the places of related records.

How close is close enough for search? If the record is floating near the street, or the neighbourhood, that it belongs in, is that close enough?

And where people need micro-detail location and other annotations, how can they best provide their improvements for re-use by others?


Thoughts on Unlocking Historical Directories

January 26, 2010

Last week I talked with Evelyn Cornell, of the Historical Directories project at the University of Leicester. The directories are mostly local listings information, trade focused, that pre-date telephone directories. Early ones are commercial ventures, later ones often produced with the involvement of public records offices and postal services. The ones digitised at the library in Leicester cover England and Wales from 1750 to 1919.

This is a rich resource for historic social analysis, with lots of detail about locations and what happened in them. On the surface, the directories have a lot of research value for genealogy and local history. Below the surface, waiting to be mined, is location data for social science, economics, enriching archives.

Evelyn is investigating ways to link the directories with other resources, or to find them by location search, to help make them more re-useful for more people.

How can the Unlock services help realise the potential in the Historical Directories? And will Linked Data help? There are two strands here – looking at the directories as data collections, and looking at the data implicit in the collections.

Let’s get a bit technical, over the fold.

Geo-references for the directories

Right now, each directory is annotated with placenames – the names of one or more counties containing places in the directory. Headings or sub-sections in the document may also contain placenames. Sample record for a directory covering Bedfordshire

As well as a name, the directories could have a link identifying a place. For example, the geonames Linked Data URL for Bedfordshire. The link can be followed to get approximate coordinates for use on a map display. This provides an easy way to connect with other resources that use the same link.

The directory records would also benefit from simpler, re-usable links. Right now they have quite complex-looking URLs that look like lookup.asp?[lots of parameters]. To encourage re-use, it’s worth composing links that look cleaner, more like /directory/1951/kellys_trade/ This could also help with search engine indexing, making the directories more findable via Google. There are some Cabinet Office guidelines on URIs for the Public Sector that could be useful here.

Linked Data for the directories

Consider making each ‘fact file’ of metadata for a given directory available in a machine-readable form, using common Dublin Core elements where possible. This could be done embedded in the page, using a standard like RDFa or it could be done at a separate URL, with an XML document describing and linking to the record.

Consider a service like RCAHMS’ Scotland’s Places, which brings together related items from the catalogues of several different public records bodies in Scotland, when you visit a location page. Behind the scenes, different archives are being “cross-searched” via a web API, with records available in XML.

Mining the directories

The publications on the Historical Directories site are in PDF format. There have been OCR scans done but these aren’t published on the site – they are used internally for full-text search. (Though note the transcripts along with the scans are available for download from the UK Data Archive). The fulltext search on the Historical Directories site works really well, with highlights for found words in the PDF results.

But the gold in a text-mining effort like this is found in locations of the individual records themselves – the listings connected to street addresses and buildings. This kind of material is perfect for rapid demographic analysis. The Visualising Urban Geographies project between the National Library of Scotland and University of Edinburgh is moving in this direction – automatically geo-coding addresses to “good enough” accuracy. Stuart Nicol has made some great teaching tools using search engine geocoders embedded in a Google Spreadsheet.

But this demands a big transition – from “raw” digitised text, to structured tabular data. As Rich Gibson would say about Planet Earth – “It’s not even regularly irregular”, and can’t currently be successfully automated.

Meanwhile of the directories do have more narrative,descriptive text, interleaved with tabular data on population, trade, livestock. This material reminds me of the Statistical Accounts of Scotland.

For this kind of data there may be useful yield from the Unlock Text geoparsing service – extracting placenames and providing gazetteer links for the directory. Places mentioned in Directories will necessarily be clustered together, so the geoparser’s techniques for ranking suggested locations and picking the most likely one, should work well.

This is skimming the surface of what could be done with historic directories, and I would really like to hear about other related efforts.