Thoughts on Unlocking Historical Directories

January 26, 2010

Last week I talked with Evelyn Cornell, of the Historical Directories project at the University of Leicester. The directories are mostly local listings information, trade focused, that pre-date telephone directories. Early ones are commercial ventures, later ones often produced with the involvement of public records offices and postal services. The ones digitised at the library in Leicester cover England and Wales from 1750 to 1919.

This is a rich resource for historic social analysis, with lots of detail about locations and what happened in them. On the surface, the directories have a lot of research value for genealogy and local history. Below the surface, waiting to be mined, is location data for social science, economics, enriching archives.

Evelyn is investigating ways to link the directories with other resources, or to find them by location search, to help make them more re-useful for more people.

How can the Unlock services help realise the potential in the Historical Directories? And will Linked Data help? There are two strands here – looking at the directories as data collections, and looking at the data implicit in the collections.

Let’s get a bit technical, over the fold.

Geo-references for the directories

Right now, each directory is annotated with placenames – the names of one or more counties containing places in the directory. Headings or sub-sections in the document may also contain placenames. Sample record for a directory covering Bedfordshire

As well as a name, the directories could have a link identifying a place. For example, the geonames Linked Data URL for Bedfordshire. The link can be followed to get approximate coordinates for use on a map display. This provides an easy way to connect with other resources that use the same link.

The directory records would also benefit from simpler, re-usable links. Right now they have quite complex-looking URLs that look like lookup.asp?[lots of parameters]. To encourage re-use, it’s worth composing links that look cleaner, more like /directory/1951/kellys_trade/ This could also help with search engine indexing, making the directories more findable via Google. There are some Cabinet Office guidelines on URIs for the Public Sector that could be useful here.

Linked Data for the directories

Consider making each ‘fact file’ of metadata for a given directory available in a machine-readable form, using common Dublin Core elements where possible. This could be done embedded in the page, using a standard like RDFa or it could be done at a separate URL, with an XML document describing and linking to the record.

Consider a service like RCAHMS’ Scotland’s Places, which brings together related items from the catalogues of several different public records bodies in Scotland, when you visit a location page. Behind the scenes, different archives are being “cross-searched” via a web API, with records available in XML.

Mining the directories

The publications on the Historical Directories site are in PDF format. There have been OCR scans done but these aren’t published on the site – they are used internally for full-text search. (Though note the transcripts along with the scans are available for download from the UK Data Archive). The fulltext search on the Historical Directories site works really well, with highlights for found words in the PDF results.

But the gold in a text-mining effort like this is found in locations of the individual records themselves – the listings connected to street addresses and buildings. This kind of material is perfect for rapid demographic analysis. The Visualising Urban Geographies project between the National Library of Scotland and University of Edinburgh is moving in this direction – automatically geo-coding addresses to “good enough” accuracy. Stuart Nicol has made some great teaching tools using search engine geocoders embedded in a Google Spreadsheet.

But this demands a big transition – from “raw” digitised text, to structured tabular data. As Rich Gibson would say about Planet Earth – “It’s not even regularly irregular”, and can’t currently be successfully automated.

Meanwhile of the directories do have more narrative,descriptive text, interleaved with tabular data on population, trade, livestock. This material reminds me of the Statistical Accounts of Scotland.

For this kind of data there may be useful yield from the Unlock Text geoparsing service – extracting placenames and providing gazetteer links for the directory. Places mentioned in Directories will necessarily be clustered together, so the geoparser’s techniques for ranking suggested locations and picking the most likely one, should work well.

This is skimming the surface of what could be done with historic directories, and I would really like to hear about other related efforts.

Unlock places API — version 2.1

January 22, 2010

The Unlock places API was upgraded this week, with new functionality available from Tuesday, 19th January 2010.  An upgrade to the Postgres/PostGIS database has enabled a new ways of retrieving feature data from the gazetteer, so please visit the example queries page to try them out.

We welcome any feedback on the new features – and if there’s anything you’d like to see in future versions of Unlock, please let us know.  Alternatively, why not just get in touch to let us know how you’re using the service, we’d love to hear from you!

Full details of the changes are listed below the fold.

Read the rest of this entry »

Places you won’t find in any dictionary

January 12, 2010

Tobar an Dualchais is an amazing archive of Gaelic and Scots speech and song samples. Under the hood, each of their records is annotated with places – the names of the village, or island, or parish, where the speaker came from.

We’ve been trying to Unlock their placename data, so the names can be given map coordinates, and the recordings searched by location. Also, I wanted to see how much difference it would make if the Ordnance Survey 50K gazetteer were open licensed, thus enabling us to use it for this (non-research) project.

Out of 1628 placenames, we found 851 exact matches in the 50K gazetteer and 1031 in the gazetteer. Just 90 placenames were in the 50K but not in geonames. There’s a group of 296 placenames that we couldn’t find in any of our gazetteer data sources. Note that this an unusual sample, focused on remote and infrequently surveyed places in the Highland and Islands, but I had hoped for more from the 50K coverage.

There are quite a few fun reasons why there are so many placenames that you won’t find in any dictionary:

  • Places that are historic don’t appear in our contemporary OS sources. Many administrative areas in Scotland changed in 1974, and current OS data does not have the old names or boundaries. Geonames has some locations for historic places (e.g. approximate centroids for the old counties) though without time ranges.
  • Typographical errors in data entry. E.g. “Stornooway” and “Stornaway” – using the gazetteer web service at the content creation stage would help with this.
  • Listings for places that are too small to be in a mid-scale gazetteer. For example, TAD data includes placenames for buildings belonging to clubs and societies where Gaelic sound recordings were made. Likely enough, some small settlements have escaped the notice of surveyors for OS and contributors to geonames.
  • Some places exist socially but not administratively. For example, our MasterMap gazetteer has records for a “Clanyard Bay”, “Clanyard House”, “Clanyard Mill” but not Clanyard itself. The Gazetteer for Scotland describes Clanyard as “a locality, made up of settlements” – High, Low and Middle Clanyards.
  • Geonames has local variant spellings as alternative names, and these show up in our gazetteer search, returning the more “authoritative” name.
  • Limitations in automated search for descriptions of names. For example, some placenames look like Terregles (DFS) see also Kirkcudbrightshire. I’m hoping the new work on fulltext search will help to address this – but there will always need to be a human confirmation stage, and fixes to the original records.

It’s been invaluable to have a big set of known-to-be-placenames contributed in free-text fields by people who aren’t geographers. I would like to do more of this.

I saw a beautiful transcript of an Ordnance Survey Object Name Book on a visit to RCAHMS. Apparently many for the English and Welsh ones were destroyed in the war, but the Scottish ones survived. But that is a story for another time.

Linked Data, JISC and Access

January 8, 2010

With 2010 hindsight, I can smile at statements like:

“The Semantic Web can provide an underlying framework to allow the deployment of service architecture to support virtual organisations. This concept is now sometimes given the description the Semantic Grid.”

But that’s how it looked in the 2005 JISC report on “semantic web technologies”, which Paul Miller reviews at the start of his draft report on Linked Data Horizons.

I appreciate the new focus on fundamental raw data, the “core set of widely used identifiers” which connect topic areas and enable more of JISC’s existing investments to be linked up and re-used. JACS codes for undergraduate courses, or ISSNs for academic journals – simple things that can be made quickly and cheaply available in RDF, for open re-use.

It was a while after I read Paul’s draft before I clocked what is missing – a consideration of how Access Management schemes will affect the use of Linked Data in academic publishing.

Many JISC services require a user to prove their academic credentials; so do commercial publishers, public sector archives – the list is long, and growing.

URLs may have user/session identifiers in them, and to access a URL may involve a web-browser-dependent Shibboleth login process that touches on multiple sites.

Publishers support UK Federation, and sell subscriptions to institutions. On their public sites, one can see summaries, abstracts, thumbnails, but to get data, one has to be attached to an institution that pays a subscription and is part of the Federation.

Sites can publish Linked Data in RDF about their data resources. But if publishers want their data to be linked and indexed, they have to make two URLs for each bit of content; one public, one protected. Some data services are obliged to stay entirely Shibboleth-protected for licensing reasons, because the data available there is derived from other work that is licensed for academic use only.

EDINA’s ShareGeo service has this problem – its RSS feed of new data sets published by users is public, but to look at the items in it, one has to log in to Digimap through the UK Federation.

Unfortunately this breaks with one of the four Linked Data Principles – “When someone looks up a URI, provide useful information, using the standards“.

Outwith the access barrier, non-commercial terms of use for scholarly resources don’t complement a Linked Data approach well. For example, OCLC’s WorldCat bibliography search forbids “automated information-gathering devices“, which would catch a crawler/indexer looking for RDF. As Paul tactfully puts it:

To permit effective and widespread reuse, data must be explicitly licensed in ways that encourage third party engagement.