Nicola Farnworth from the UK Data Archive sent us a motherlode of user-contributed UK placenames – a list extracted from the 1881 census returns. The list is 910096 lines long.
Many placenames have the name of a containing county, though some don’t. The data is full of errors, mistakes in the original records, mis-heard names, maybe errors in transcription.
This census placename data badly needs a quality audit; how can Unlock Places help provide links to location references and clean up messy location data?
I made a start at this over the weekend, because I also wanted an excuse to play with the redis nosql data store.
To start, I threw the list of unique placenames against the geonames.org names in the Unlock Places API. The gazetteer is used to ground the placename list against known places, rather than search for exact locations at this stage, we look for known-to-exist-as-place names. The search function I used, closestMatchSearch, does a fulltext search for very close matches. It took getting on for 36 hours to run the whole lot.
unique placenames: 667513
known by geonames: 34180
unknown by geonames: 633333
We might hope for more, but this is a place to start. On manual inspection I noticed small settlements that are definitely in OpenStreetmap’s data. The Ordnance Survey 50K gazetteer, were it open data, would likely yield more initial matches.
Next, each of the unlocated placenames is compared to the grounded group of places, and if one name is very similar to another (as measured by Levenshtein distance with a handy python module) then a reference is stored that one place is the sameAs another.
Based on the results of a test run, this string similarity test should yield at least 100,000 identities between placenames. Hard to say at this stage how many will be in some kind of error (Easton matching Aston), 1 in 20 or hopefully many fewer.
place:sameas:WELBOURN : place:WELBURN
place:sameas:WELBOURY : place:WELBURY
place:sameas:ALSHORNE : place:ASHORNE
place:sameas:PHURLIGH : place:PURLEIGH
place:sameas:LANGATHN : place:LLANGATHEN
place:sameas:WIGISTON : place:WIGSTON
place:sameas:ALSHORPE : place:ASHOPE
place:sameas:PELSCHAM : place:ELSHAM
As I next stage, I plan to run the similarity test again, on the placenames derived from it in the first stage, with a higher threshold for similarity.
This should start getting the placenames yet to be located down to a manageable few hundred thousand. I hope to run the remaining set against OpenStreetmap’s Nominatim geocoding search service. I should probably write to them and mention this.
There’s more to be done in cleaning and splitting the data. Some placenames are really addresses (which may well turn up through Nominatim) others are sub-regions or suburbs attached to other placenames, north/south/east/west prefixes.
Ultimately there will be a large set of possible placenames, many tens of thousands, which aren’t reliably found in any gazetteer. How to address this?
A human annotator can be assisted by programs. We have a high threshold of acceptance for similarity of names for automatic link creation; we can lower that threshold a lot if a human is attesting to the result.
We can also look at sound similarity algorithms like soundex and metaphone. There are concerns that this would have an unacceptable rate of false positives, but if a human annotator is intervening anyway, why not show rough-guess suggestions?
A link back to the original source records would be of much benefit. Presumably the records come in sequences or sets which all deal with the same geographic region, more or less. By looking at clusters of placenames in a set of related documents, we can help pinpoint the location on a map (perhaps even pick out a name from a vector map layer).
Records with unknown placenames can be roughly located near the places of related records.
How close is close enough for search? If the record is floating near the street, or the neighbourhood, that it belongs in, is that close enough?
And where people need micro-detail location and other annotations, how can they best provide their improvements for re-use by others?