Linking Placename Authorities

April 9, 2010


Putting together a proposal for JISC call 02/10 based on a suggestion from Paul Ell at CDDA in Belfast. Why post it here? I think there’s value in working on these things in a more public way, and I’d like to know who else would find the work useful.

Summary

Generating a gazetteer of historic UK placenames, linked to documents and authority files in Linked Data form. Both working with existing placename authority files, and generating new authority files by extracting geographic names from text documents. Using the Edinburgh Geoparser to “georesolve” placenames and link them to widely-used geographic entities on the Linked Data web.

Background

GeoDigRef was a JISC project to extract references to people and places from several very large digitised collections, to make them easier to search. The Edinburgh Geoparser was adapted to extract place references from large collections.

One roadblock in this and other projects has been the lack of open historic placename gazetteer for the UK.

Placenames in authority files, and placenames text-mined from documents, can be turned into geographic links that connect items in collections with each other and with the Linked Data web; a historic gazetteer for the UK can be built as a byproduct.

Proposal

Firstly, working with placename authority files from existing collections, starting with the existing digitised volumes from the English Place Name Survey as a basis.

Where place names are found, they can be linked to the corresponding Linked Data entity in geonames.org, the motherlode of place name links on the Linked Data web, using the georesolver component of the Edinburgh Geoparser.

Secondly, using the geoparser to extract placename references from documents and using those placenames to seed an authority file, which can then be resolved in the same way.

An open source web-based tool will help users link places to one another, remove false positives found by the geoparser, and publish the results as RDF using an open data license.

Historic names will be imported back into the Unlock place search service.

Context

This will leave behind a toolset for others to use, as well as creating new reference data.

Building on work done at the Open Knowledge Foundation to convert MARC/MADS bibliographic resources to RDF and add geographic links.

Making re-use of existing digitised resources from CDDA to help make them discoverable, provide a path in to researchers.

Geonames.org has some historic coverage, but it is hit and miss (E.g. “London” has “Londinium” as an alternate name, but at the contemporary location). The new OS OpenData sources are all contemporary.

Once a placename is found in a text, it may not be found in a gazetteer. The more places correctly located, the higher the likelihood that other places mentioned in a document will also be correctly located. More historic coverage means better georeferencing for more archival collections.

Advertisements

Work in progress with OS Open Data

April 2, 2010

The April 1st release of many Ordnance Survey datasets as open data is great news for us at Unlock. As hoped for, Boundary-Line (administrative boundaries), the 50K gazetteer of placenames and a modified version of Code-Point (postal locations) are now open data.

Boundary Line of Edinburgh shown on Google earth. Contains Ordnance Survey data © Crown copyright and database right 2010

We’ll be putting these datasets into the open access part of Unlock Places, our place search service, and opening up Unlock Geocodes based on Code-Point Open. However, this is going to take a week or two, because we’re also adding some new features to Unlock’s search and results.

Currently, registered academic users are able to:

  • Grab shapes and bounding boxes in KML or GeoJSON – no need for GIS software, re-use in web applications
  • Search by bounding box and feature type as well as place name
  • See properties of shapes (area, perimeter, central point) useful for statistics visualisation

And in soon we’ll be publishing these new features currently in testing:

  • Relationships between places – cities, counties and regions containing found places – in the default results
  • Re-project points and shapes into different coordinate reference systems

These have been added so we can finally plug the Unlock Places search into EDINA’s Digimap service.

Having Boundary-Line shapes in our open data gazetteer will mean we can return bounding boxes or polygons through Unlock Text, which extracts placenames from documents and metadata. This will help to open up new research directions for our work with the Language Technology Group at Informatics in Edinburgh.

There are some organisations we’d love to collaborate with (almost next door, the Map Library at the National Library of Scotland and the Royal Commission on Ancient and Historical Monuments of Scotland) but have been unable to, because Unlock and its predecessor GeoCrossWalk were limited by license to academic use only. I look forward to seeing all the things the OS Open Data release has now made possible.

I’m also excited to see what re-use we and others could make of the Linked Data published by Ordnance Survey Research, and what their approach will be to connecting shapes to their administrative model.

MasterMap, the highest-detail OS dataset, wasn’t included in the open release. Academic subscribers to the Digimap Ordnance Survey Collection get access to places extracted from MasterMap, and improvements to other datasets created using MasterMap, with an Unlock Places API key.


OpenSearch Geospatial in progress

March 15, 2010

One promising presentation I saw last week at the Jornadas SIG Libre – Oscar Fonts’ work in the Geographic Information Group at the Universitat Jaume I building on OpenSearch Geospatial interfaces to different services. OpenSearch geo query of OSM

The demonstrator showed during the talk was an OpenLayers map display hooked up to various OpenSearch Geo services.

Some are “native” OpenSearch services, like the GeoCommons data deposit and mapmaking service, the interfaces published by Terradue as part of the European GENESI-DR earth observation distributed data repository project.

The UJI demo also includes an API adapter for sensationally popular web services with geographic contents. Through the portal one can search for tweets, geotagged Flickr photos, or individual shapes from OpenStreetmap.

Oscar’s talk highlighted the problem of seeming incompatibility between the original draft of the OpenSearch Geospatial extensions, and the version making its way through the Open Geospatial Consortium’s Catalog working group as a “part document” included in the next Catalog Services for the Web specification.

The issues currently breaking backwards-compatibility between the versions are these:

      geo:locationString became geo:name in the OGC draft version.
      geo:polygon was omitted from the OGC draft version, and replaced with geo:geometry which allows for complex geometries (including multi-polygons) to be passed through using Well Known Text.

1) looks like syntactic sugar – geo:name is less typing, and reads better. geo:locationString can be deprecated but supported.

2) geo:geometry was introduced into the spec as a result of work on the GENESI-DR project, which had a strong requirement to support multi-polygons (specifically, passes over the earth of a satellite, which crossed the dateline and thus were made up of two polygons meeting on either side of the dateline).

geo:polygon has a much simpler syntax, just a list of (latitude, longitude) pairs which join up to make a shape. This also restricts queries to two dimensions.

This seems to be the nub of the discussion – should geo:polygon be included in the updated version – risking it being seen as clashing with or superfluous to geo:geometry, leading to end user confusion?

There is always a balance to be met between simplicity and complexity, Oscar pointed out in his talk what I have heard in OGC Catalog WG discussions too – that as soon as a use case becomes sufficiently complex, then CSW is available and likely fitter for the job. geo:geometry is already at the top end of acceptable complexity.

It’s about a year since I helped turn Andrew Turner’s original draft into an OGC consumable form. Anecdotally it seems like a lot more people are interested in seeing what can be done with OpenSearch Geo now.

The OGC version is not a fork. The wiki draft was turned into a draft OGC spec after talking with Andrew and Raj Singh about the proposed changes, partly on the OpenSearch Google Group. The geo:relation parameter was added on the basis of feedback from the GeoNetwork and GeoTools communities. There’s been a Draft 2 page, as yet unmodified, on the OpenSearch wiki since that time.

In order to build the confidence of potential adopters, these backwards-incompatibilities do need to be addressed. Personal point of view would be to update the wiki draft, deprecating locationString and including both polygon and geometry parameters.

I was impressed by the work of Oscar and collaborators, though wondering if they are going to move in to aggregation and indexing, search-engine-style, of the results, or just use the OpenSearch interface to search in realtime fairly fast moving sources of data. I wish I’d asked this question in the session, now. It all offers reinforcement and inspiration for putting OpenSearch Geo interfaces on services nearby – Go-Geo!, CKAN. The NERC Data Discovery Service could benefit, as could SCRAN. We’ll get to see what happens, which I’m glad of.


Dev8D: JISC Developer Days

March 5, 2010

The Unlock development team recently attended the Dev8D: JISC Developer Days conference at University College London. The format of the event is fairly loose, with multiple sessions in parallel and the programme created dynamically as the 4 days progressed. Delegates are encouraged to use their feet to seek out what interests them! The idea is simple: developers, mainly (but not exclusively) from academic organisations come together to share ideas, work together and strengthen professional and social connections.

A series of back-to-back 15 minute ‘lightning talks’ ran throughout the conference, I delivered two – describing EDINA’s Unlock services and showing users how to get started with the Unlock Places APIs. Discussions after the talk focused on the question of open sourcing and the licensing of Unlock Places software generally – and what future open gazetteer data sources we plan to include.

In parallel with the lightning talks, workshop sessions were held on a variety of topics such as linked data, iPhone application development, working with Arduino and the Google app engine.

Competitions
Throughout Dev8D, several competitions or ‘bounties’ were held around different themes. In our competition, delegates had the chance to win a £200 Amazon voucher by entering a prototype application making use of the Unlock Places API. The most innovative and useful application wins!

I gave a quick announcement at the start of the week to discuss the competition, how to get started using the API and then demonstrated a mobile client for the Unlock Places gazetteer as an example of the sort of competition entry we were looking for. This application makes use of the new HTML5 web database functionality – enabling users to download and store Unlock’s feature data offline on a mobile device. Here’s some of the entries:

Marcus Ramsden from Southampton University created a plugin for EPrints, the open access respository software. Using the Unlock Text geoparser, ‘GeoPrints’ extracts locations from documents uploaded to EPrints then provides a mechanism to browse EPrint documents using maps.

Aidan Slingsby from City University, entered some beautiful work displaying point data (in this case a gazetteer of British placenames) shown as as tag-maps, density estimation surfaces and chi surfaces rather than the usual map-pins! The data was based on GeoNames data accessed through the Unlock Places API.

And the winner was… Duncan Davidson from Informatics Ventures, University of Edinburgh. He used the Unlock Places APIs together with Yahoo Pipes to present data on new start-ups and projects around Scotland. Enabling the conversion of data containing local council names into footprints, Unlock Places allowed the data to be mapped using KML and Google Maps, enabling his users to navigate around the data using maps – and search the data using spatial constraints.

Some other interesting items at Dev8D…

  • <sameAs>
    Hugh Glaser from the University of Southampton discussed how sameAs.org works to establish linkage between datasets by managing multiple URIs for Linked Data without an authority. Hugh demonstrated using sameAs.org to locate co-references between different data sets.
  • Mendeley
    Mendeley
    is a research network built around the same principle as last.fm. Jan Reichelt and Ben Dowling discussed how by tracking, sharing and organising journal/article history, Mendeley is designed to help users to discover and keep in touch with similarly minded researchers. I heard of Mendeley last year and was surprised by the large (and rapidly increasing) user base – the collective data from its users is already proving a very powerful resource.
  • Processing
    Need to do rapid visualisation of images, animations or interactions? Processing is Java based sketchbox/IDE which will help you to to visualise your data much quicker. Ross McFarlane from the University of Liverpool gave a quick tutorial of Processing.js, a JavaScript port using <Canvas>, illustrating the power and versatility of this library.
  • Genetic Programming
    This session centred around some basic aspects of Genetic Algorithms/Evolutionary Computing and Emergent properties of evolutionary systems. Delegates focused on creating virtual ants (with Python) to solve mazes and by visualising their creatures with Processing (above), Richard Jones enabled developers to work on something a bit different!
  • Web Security
    Ben Charlton from the University of Kent delivered an excellent walk-through of the most significant and very common threats to web applications. Working from the OWASP Top 10 project, he discussed each threat with real world examples. Great stuff – important for all developers to see.
  • Replicating 3D Printer: RepRap
    Adrian Bowyer demonstrated RepRap – short for Replicating Rapid-prototyper. It’s an open source (GPL) device, able to create robust 3D plastic components (including around half of its own components). Its novel capability of being able to self-copy, with material costs of only €350 makes it accessible to small communities in the developing world as well as individuals in the developed world. His inspiring talk was well received and this super illustration of open information’s far reaching implications captured everyone’s imagination.

All in all, a great conference. A broad spread of topics, with the right mix of sit-and-listen to get-involved activities. Whilst Dev8D is a fairly chaotic event, it’s clear that it generates a wealth of great ideas, contacts and even new products and services for academia. See Dev8D’s Happy Stories page for a record of some of the outcomes. I’m now looking forward to seeing how some of the prototypes evolve and I’m definitely looking forward to Dev8D 2011.


Thoughts on Unlocking Historical Directories

January 26, 2010

Last week I talked with Evelyn Cornell, of the Historical Directories project at the University of Leicester. The directories are mostly local listings information, trade focused, that pre-date telephone directories. Early ones are commercial ventures, later ones often produced with the involvement of public records offices and postal services. The ones digitised at the library in Leicester cover England and Wales from 1750 to 1919.

This is a rich resource for historic social analysis, with lots of detail about locations and what happened in them. On the surface, the directories have a lot of research value for genealogy and local history. Below the surface, waiting to be mined, is location data for social science, economics, enriching archives.

Evelyn is investigating ways to link the directories with other resources, or to find them by location search, to help make them more re-useful for more people.

How can the Unlock services help realise the potential in the Historical Directories? And will Linked Data help? There are two strands here – looking at the directories as data collections, and looking at the data implicit in the collections.

Let’s get a bit technical, over the fold.

Geo-references for the directories

Right now, each directory is annotated with placenames – the names of one or more counties containing places in the directory. Headings or sub-sections in the document may also contain placenames. Sample record for a directory covering Bedfordshire

As well as a name, the directories could have a link identifying a place. For example, the geonames Linked Data URL for Bedfordshire. The link can be followed to get approximate coordinates for use on a map display. This provides an easy way to connect with other resources that use the same link.

The directory records would also benefit from simpler, re-usable links. Right now they have quite complex-looking URLs that look like lookup.asp?[lots of parameters]. To encourage re-use, it’s worth composing links that look cleaner, more like /directory/1951/kellys_trade/ This could also help with search engine indexing, making the directories more findable via Google. There are some Cabinet Office guidelines on URIs for the Public Sector that could be useful here.

Linked Data for the directories

Consider making each ‘fact file’ of metadata for a given directory available in a machine-readable form, using common Dublin Core elements where possible. This could be done embedded in the page, using a standard like RDFa or it could be done at a separate URL, with an XML document describing and linking to the record.

Consider a service like RCAHMS’ Scotland’s Places, which brings together related items from the catalogues of several different public records bodies in Scotland, when you visit a location page. Behind the scenes, different archives are being “cross-searched” via a web API, with records available in XML.

Mining the directories

The publications on the Historical Directories site are in PDF format. There have been OCR scans done but these aren’t published on the site – they are used internally for full-text search. (Though note the transcripts along with the scans are available for download from the UK Data Archive). The fulltext search on the Historical Directories site works really well, with highlights for found words in the PDF results.

But the gold in a text-mining effort like this is found in locations of the individual records themselves – the listings connected to street addresses and buildings. This kind of material is perfect for rapid demographic analysis. The Visualising Urban Geographies project between the National Library of Scotland and University of Edinburgh is moving in this direction – automatically geo-coding addresses to “good enough” accuracy. Stuart Nicol has made some great teaching tools using search engine geocoders embedded in a Google Spreadsheet.

But this demands a big transition – from “raw” digitised text, to structured tabular data. As Rich Gibson would say about Planet Earth – “It’s not even regularly irregular”, and can’t currently be successfully automated.

Meanwhile of the directories do have more narrative,descriptive text, interleaved with tabular data on population, trade, livestock. This material reminds me of the Statistical Accounts of Scotland.

For this kind of data there may be useful yield from the Unlock Text geoparsing service – extracting placenames and providing gazetteer links for the directory. Places mentioned in Directories will necessarily be clustered together, so the geoparser’s techniques for ranking suggested locations and picking the most likely one, should work well.

This is skimming the surface of what could be done with historic directories, and I would really like to hear about other related efforts.


Unlock places API — version 2.1

January 22, 2010

The Unlock places API was upgraded this week, with new functionality available from Tuesday, 19th January 2010.  An upgrade to the Postgres/PostGIS database has enabled a new ways of retrieving feature data from the gazetteer, so please visit the example queries page to try them out.

We welcome any feedback on the new features – and if there’s anything you’d like to see in future versions of Unlock, please let us know.  Alternatively, why not just get in touch to let us know how you’re using the service, we’d love to hear from you!

Full details of the changes are listed below the fold.

Read the rest of this entry »


Places you won’t find in any dictionary

January 12, 2010

Tobar an Dualchais is an amazing archive of Gaelic and Scots speech and song samples. Under the hood, each of their records is annotated with places – the names of the village, or island, or parish, where the speaker came from.

We’ve been trying to Unlock their placename data, so the names can be given map coordinates, and the recordings searched by location. Also, I wanted to see how much difference it would make if the Ordnance Survey 50K gazetteer were open licensed, thus enabling us to use it for this (non-research) project.

Out of 1628 placenames, we found 851 exact matches in the 50K gazetteer and 1031 in the geonames.org gazetteer. Just 90 placenames were in the 50K but not in geonames. There’s a group of 296 placenames that we couldn’t find in any of our gazetteer data sources. Note that this an unusual sample, focused on remote and infrequently surveyed places in the Highland and Islands, but I had hoped for more from the 50K coverage.

There are quite a few fun reasons why there are so many placenames that you won’t find in any dictionary:

  • Places that are historic don’t appear in our contemporary OS sources. Many administrative areas in Scotland changed in 1974, and current OS data does not have the old names or boundaries. Geonames has some locations for historic places (e.g. approximate centroids for the old counties) though without time ranges.
  • Typographical errors in data entry. E.g. “Stornooway” and “Stornaway” – using the gazetteer web service at the content creation stage would help with this.
  • Listings for places that are too small to be in a mid-scale gazetteer. For example, TAD data includes placenames for buildings belonging to clubs and societies where Gaelic sound recordings were made. Likely enough, some small settlements have escaped the notice of surveyors for OS and contributors to geonames.
  • Some places exist socially but not administratively. For example, our MasterMap gazetteer has records for a “Clanyard Bay”, “Clanyard House”, “Clanyard Mill” but not Clanyard itself. The Gazetteer for Scotland describes Clanyard as “a locality, made up of settlements” – High, Low and Middle Clanyards.
  • Geonames has local variant spellings as alternative names, and these show up in our gazetteer search, returning the more “authoritative” name.
  • Limitations in automated search for descriptions of names. For example, some placenames look like Terregles (DFS) see also Kirkcudbrightshire. I’m hoping the new work on fulltext search will help to address this – but there will always need to be a human confirmation stage, and fixes to the original records.

It’s been invaluable to have a big set of known-to-be-placenames contributed in free-text fields by people who aren’t geographers. I would like to do more of this.

I saw a beautiful transcript of an Ordnance Survey Object Name Book on a visit to RCAHMS. Apparently many for the English and Welsh ones were destroyed in the war, but the Scottish ones survived. But that is a story for another time.