Linking Placename Authorities

April 9, 2010

Putting together a proposal for JISC call 02/10 based on a suggestion from Paul Ell at CDDA in Belfast. Why post it here? I think there’s value in working on these things in a more public way, and I’d like to know who else would find the work useful.


Generating a gazetteer of historic UK placenames, linked to documents and authority files in Linked Data form. Both working with existing placename authority files, and generating new authority files by extracting geographic names from text documents. Using the Edinburgh Geoparser to “georesolve” placenames and link them to widely-used geographic entities on the Linked Data web.


GeoDigRef was a JISC project to extract references to people and places from several very large digitised collections, to make them easier to search. The Edinburgh Geoparser was adapted to extract place references from large collections.

One roadblock in this and other projects has been the lack of open historic placename gazetteer for the UK.

Placenames in authority files, and placenames text-mined from documents, can be turned into geographic links that connect items in collections with each other and with the Linked Data web; a historic gazetteer for the UK can be built as a byproduct.


Firstly, working with placename authority files from existing collections, starting with the existing digitised volumes from the English Place Name Survey as a basis.

Where place names are found, they can be linked to the corresponding Linked Data entity in, the motherlode of place name links on the Linked Data web, using the georesolver component of the Edinburgh Geoparser.

Secondly, using the geoparser to extract placename references from documents and using those placenames to seed an authority file, which can then be resolved in the same way.

An open source web-based tool will help users link places to one another, remove false positives found by the geoparser, and publish the results as RDF using an open data license.

Historic names will be imported back into the Unlock place search service.


This will leave behind a toolset for others to use, as well as creating new reference data.

Building on work done at the Open Knowledge Foundation to convert MARC/MADS bibliographic resources to RDF and add geographic links.

Making re-use of existing digitised resources from CDDA to help make them discoverable, provide a path in to researchers. has some historic coverage, but it is hit and miss (E.g. “London” has “Londinium” as an alternate name, but at the contemporary location). The new OS OpenData sources are all contemporary.

Once a placename is found in a text, it may not be found in a gazetteer. The more places correctly located, the higher the likelihood that other places mentioned in a document will also be correctly located. More historic coverage means better georeferencing for more archival collections.


Dev8D: JISC Developer Days

March 5, 2010

The Unlock development team recently attended the Dev8D: JISC Developer Days conference at University College London. The format of the event is fairly loose, with multiple sessions in parallel and the programme created dynamically as the 4 days progressed. Delegates are encouraged to use their feet to seek out what interests them! The idea is simple: developers, mainly (but not exclusively) from academic organisations come together to share ideas, work together and strengthen professional and social connections.

A series of back-to-back 15 minute ‘lightning talks’ ran throughout the conference, I delivered two – describing EDINA’s Unlock services and showing users how to get started with the Unlock Places APIs. Discussions after the talk focused on the question of open sourcing and the licensing of Unlock Places software generally – and what future open gazetteer data sources we plan to include.

In parallel with the lightning talks, workshop sessions were held on a variety of topics such as linked data, iPhone application development, working with Arduino and the Google app engine.

Throughout Dev8D, several competitions or ‘bounties’ were held around different themes. In our competition, delegates had the chance to win a £200 Amazon voucher by entering a prototype application making use of the Unlock Places API. The most innovative and useful application wins!

I gave a quick announcement at the start of the week to discuss the competition, how to get started using the API and then demonstrated a mobile client for the Unlock Places gazetteer as an example of the sort of competition entry we were looking for. This application makes use of the new HTML5 web database functionality – enabling users to download and store Unlock’s feature data offline on a mobile device. Here’s some of the entries:

Marcus Ramsden from Southampton University created a plugin for EPrints, the open access respository software. Using the Unlock Text geoparser, ‘GeoPrints’ extracts locations from documents uploaded to EPrints then provides a mechanism to browse EPrint documents using maps.

Aidan Slingsby from City University, entered some beautiful work displaying point data (in this case a gazetteer of British placenames) shown as as tag-maps, density estimation surfaces and chi surfaces rather than the usual map-pins! The data was based on GeoNames data accessed through the Unlock Places API.

And the winner was… Duncan Davidson from Informatics Ventures, University of Edinburgh. He used the Unlock Places APIs together with Yahoo Pipes to present data on new start-ups and projects around Scotland. Enabling the conversion of data containing local council names into footprints, Unlock Places allowed the data to be mapped using KML and Google Maps, enabling his users to navigate around the data using maps – and search the data using spatial constraints.

Some other interesting items at Dev8D…

  • <sameAs>
    Hugh Glaser from the University of Southampton discussed how works to establish linkage between datasets by managing multiple URIs for Linked Data without an authority. Hugh demonstrated using to locate co-references between different data sets.
  • Mendeley
    is a research network built around the same principle as Jan Reichelt and Ben Dowling discussed how by tracking, sharing and organising journal/article history, Mendeley is designed to help users to discover and keep in touch with similarly minded researchers. I heard of Mendeley last year and was surprised by the large (and rapidly increasing) user base – the collective data from its users is already proving a very powerful resource.
  • Processing
    Need to do rapid visualisation of images, animations or interactions? Processing is Java based sketchbox/IDE which will help you to to visualise your data much quicker. Ross McFarlane from the University of Liverpool gave a quick tutorial of Processing.js, a JavaScript port using <Canvas>, illustrating the power and versatility of this library.
  • Genetic Programming
    This session centred around some basic aspects of Genetic Algorithms/Evolutionary Computing and Emergent properties of evolutionary systems. Delegates focused on creating virtual ants (with Python) to solve mazes and by visualising their creatures with Processing (above), Richard Jones enabled developers to work on something a bit different!
  • Web Security
    Ben Charlton from the University of Kent delivered an excellent walk-through of the most significant and very common threats to web applications. Working from the OWASP Top 10 project, he discussed each threat with real world examples. Great stuff – important for all developers to see.
  • Replicating 3D Printer: RepRap
    Adrian Bowyer demonstrated RepRap – short for Replicating Rapid-prototyper. It’s an open source (GPL) device, able to create robust 3D plastic components (including around half of its own components). Its novel capability of being able to self-copy, with material costs of only €350 makes it accessible to small communities in the developing world as well as individuals in the developed world. His inspiring talk was well received and this super illustration of open information’s far reaching implications captured everyone’s imagination.

All in all, a great conference. A broad spread of topics, with the right mix of sit-and-listen to get-involved activities. Whilst Dev8D is a fairly chaotic event, it’s clear that it generates a wealth of great ideas, contacts and even new products and services for academia. See Dev8D’s Happy Stories page for a record of some of the outcomes. I’m now looking forward to seeing how some of the prototypes evolve and I’m definitely looking forward to Dev8D 2011.

Linked Data, JISC and Access

January 8, 2010

With 2010 hindsight, I can smile at statements like:

“The Semantic Web can provide an underlying framework to allow the deployment of service architecture to support virtual organisations. This concept is now sometimes given the description the Semantic Grid.”

But that’s how it looked in the 2005 JISC report on “semantic web technologies”, which Paul Miller reviews at the start of his draft report on Linked Data Horizons.

I appreciate the new focus on fundamental raw data, the “core set of widely used identifiers” which connect topic areas and enable more of JISC’s existing investments to be linked up and re-used. JACS codes for undergraduate courses, or ISSNs for academic journals – simple things that can be made quickly and cheaply available in RDF, for open re-use.

It was a while after I read Paul’s draft before I clocked what is missing – a consideration of how Access Management schemes will affect the use of Linked Data in academic publishing.

Many JISC services require a user to prove their academic credentials; so do commercial publishers, public sector archives – the list is long, and growing.

URLs may have user/session identifiers in them, and to access a URL may involve a web-browser-dependent Shibboleth login process that touches on multiple sites.

Publishers support UK Federation, and sell subscriptions to institutions. On their public sites, one can see summaries, abstracts, thumbnails, but to get data, one has to be attached to an institution that pays a subscription and is part of the Federation.

Sites can publish Linked Data in RDF about their data resources. But if publishers want their data to be linked and indexed, they have to make two URLs for each bit of content; one public, one protected. Some data services are obliged to stay entirely Shibboleth-protected for licensing reasons, because the data available there is derived from other work that is licensed for academic use only.

EDINA’s ShareGeo service has this problem – its RSS feed of new data sets published by users is public, but to look at the items in it, one has to log in to Digimap through the UK Federation.

Unfortunately this breaks with one of the four Linked Data Principles – “When someone looks up a URI, provide useful information, using the standards“.

Outwith the access barrier, non-commercial terms of use for scholarly resources don’t complement a Linked Data approach well. For example, OCLC’s WorldCat bibliography search forbids “automated information-gathering devices“, which would catch a crawler/indexer looking for RDF. As Paul tactfully puts it:

To permit effective and widespread reuse, data must be explicitly licensed in ways that encourage third party engagement.