Work in progress with OS Open Data

April 2, 2010

The April 1st release of many Ordnance Survey datasets as open data is great news for us at Unlock. As hoped for, Boundary-Line (administrative boundaries), the 50K gazetteer of placenames and a modified version of Code-Point (postal locations) are now open data.

Boundary Line of Edinburgh shown on Google earth. Contains Ordnance Survey data © Crown copyright and database right 2010

We’ll be putting these datasets into the open access part of Unlock Places, our place search service, and opening up Unlock Geocodes based on Code-Point Open. However, this is going to take a week or two, because we’re also adding some new features to Unlock’s search and results.

Currently, registered academic users are able to:

  • Grab shapes and bounding boxes in KML or GeoJSON – no need for GIS software, re-use in web applications
  • Search by bounding box and feature type as well as place name
  • See properties of shapes (area, perimeter, central point) useful for statistics visualisation

And in soon we’ll be publishing these new features currently in testing:

  • Relationships between places – cities, counties and regions containing found places – in the default results
  • Re-project points and shapes into different coordinate reference systems

These have been added so we can finally plug the Unlock Places search into EDINA’s Digimap service.

Having Boundary-Line shapes in our open data gazetteer will mean we can return bounding boxes or polygons through Unlock Text, which extracts placenames from documents and metadata. This will help to open up new research directions for our work with the Language Technology Group at Informatics in Edinburgh.

There are some organisations we’d love to collaborate with (almost next door, the Map Library at the National Library of Scotland and the Royal Commission on Ancient and Historical Monuments of Scotland) but have been unable to, because Unlock and its predecessor GeoCrossWalk were limited by license to academic use only. I look forward to seeing all the things the OS Open Data release has now made possible.

I’m also excited to see what re-use we and others could make of the Linked Data published by Ordnance Survey Research, and what their approach will be to connecting shapes to their administrative model.

MasterMap, the highest-detail OS dataset, wasn’t included in the open release. Academic subscribers to the Digimap Ordnance Survey Collection get access to places extracted from MasterMap, and improvements to other datasets created using MasterMap, with an Unlock Places API key.

Advertisements

Notes on Linked Data and Geodata Quality

March 15, 2010

This is a long post talking about geospatial data quality background before moving on to Linked Data about halfway. I should probably try to break this down into smaller posts – “if I had more time, I would write less”.

Through EDINA‘s involvement with the ESDIN project between mapping and cadastral agencies (NMCAs) across Europe, I’ve picked up a bit about data quality theory (at least as it applies to geography). One of ESDIN’s goals is a common quality model for the network of cooperating NMCAs.

I’ve also been admiring Muki Haklay’s work on assessing data quality of collaborative OpenStreetmap data using comparable national mapping agency data. His recent assessment of OSM and Google MapMaker’s Haiti streetmaps showed the benefit of analytical data quality work, helping users assess how what they have matches the world, assisting with conflation to join different spatial databases together.

Today I was pointed at Martijn Van Exel’s presentation at WhereCamp EU on “map quality”, ending with a consideration of how to measure quality in OpenStreetmap. Are map and underlying data quite different when we think about quality?

The ISO specs for data quality have their origins in industrial and military quality assurance – “acceptable lot quality” for samples from a production line. One measurement, “circular error probable“, comes from ballistics design – the circle of error was once a literal circle round successive shots from an automatic weapon, indicating how wide a distance between shots, thus inaccuracy in the weapon, was tolerable.

The ISO 19138 quality models apply to highly detailed data created by national mapping agencies. There’s a need for reproducible quality assessment of other kinds of data, less detailed and less complete, from both commercial and open sources.

The ISO model presents measures of “completeness” and “consistency”. For completeness, an object or an attribute of an object is either present, or not present.

Consistency is a bit more complicated than that. In the ISO model there are error elements, and error measures. The elements are different kinds of error – logical, temporal, positional and thematic. The measures describe how the errors should be reported – as a total count, as a relative rate for a given lot, as a “circular error probable”.

Geographic data quality in this formal sense can be measured, either by a full inspection of a data set or in samples from it, in several ways:

  • Comparison to another data set, ideally of known and high quality
  • Comparing the contents of the dataset, using rules to describe what is expected.
  • Comparing samples of the dataset to the world, e.g. by intensive surveying.

The ISO specs feature a data production process view of quality measurement. NMCAs apply rules and take measurements before publishing data, submitting data to cross-border efforts with neighbouring EU countries, and later after correcting the data to make sure roads join up. Practitioners definitely think in terms of spatial information as networks or graphs, not in terms of maps.

Collaborative Quality Mapping

Muki Haklay’s group used different comparison techniques – in one instance comparing variable-quality data to known high-quality data, in another comparing the relative completeness of two variable-quality data sources.

Not so much thought has gone into the data user’s needs from quality information, as opposed to the data maintainer’s clearer needs. Relatively few specialised users will benefit from knowing the rate of consistency errors vs topological errors – for most people this level of detail won’t provide the confidence needed to reuse the information. The fundamental question is “how good is good enough?” and there is a wide spectrum of answers depending on the goals of each re-user of data.

I also see several use cases for use of quality information to flag up data which is interesting for research or search purposes, but not appropriate to use for navigation or surveying purposes, where errors can be costly.

An example: the “alpha shapes” that were produced by Flickr based on the distribution of geo-tagged images attached to a placename in a gazetteer.

Another example: polygon data produced by bleeding-edge auto-generalisation techniques that may have good results in some areas but bizarre errors in others.

Somewhat obviously, data quality information would be very useful to a data quality improvement drive. GeoFabrik made the OpenStreetmap Inspector tool, highlighting areas where nodes are disconnected or names and feature types for shapes are missing.

Quality testing

What about quality testing? When I worked as a perl programmer I enjoyed the test coverage and documentation coverage packages. A visual interface to show how much progress you’ve made on clearly documenting your code, to show how many decisions that should be tested for integrity remain untested.

Software packages come with a set of tests – ideally these tests will have helped with the development process, as well as providing the user with examples of correct and efficient use of the code, and aiding in automatic installation of packages.

Donald Knuth promoted the idea of “literate programming“, where code fully explains what it is doing. For code, this concept can be extended to “literate testing” of how well software is doing what is expected of it.

At the Digimap 10th Birthday event, Glen Hart from Ordnance Survey Research talked about increasing data usability for Linked Data efforts. I want to link to this the idea of “literate data“, and think about a data-driven approach to quality.

A registry based on CKAN, like data.gov.uk, could benefit from a quality audit. How can one take a quality approach to Linked Data?

To start with, each record has a set of attributes and to reach completeness they should all be filled in. This ranges from data license to maintainer contact information to resource download. Many records inCKAN.netare incomplete. Automated tests could be run on the presence or absence of properties for each package. The results can be display on the web, with option to view the relative quality of package collections belonging groups, or tags. The process would help identify areas that needed focus and followup. It would help to plan and follow progress on turning records into downloadable data packages. Quality testing could help reward groups that were being diligent in maintaining metadata.

The values of properties will have constraints, these can be used to test for quality – links should be reachable, email contact addresses should make at least one response. Locations in the dataset should be near locations in the metadata. Time ranges matching, ditto. Values that should be numbers, actually are numbers.

Some datasets listed in the data.gov.uk catalogues have URLs that don’t dereference, i.e. are links that don’t work. It’s difficult to find out what packages these datasets are attached to, where to get the actual data or contact the maintainers.

To see this in real data, visit the bare SPARQL endpoint at http://services.data.gov.uk/analytics/sparql and paste this query into the search box (it’s looking for everything described as a Dataset, using the scovo vocabulary for statistical data):

PREFIX scv: <http://purl.org/NET/scovo#&gt;

SELECT DISTINCT ?p
WHERE {
?p a scv:Dataset .
}

The response shows a set of URIs which, when you try to look them up to get a full description, return a “Resource not found” error. The presence of a quality test suite would catch this kind of incompleteness early in the release schedule, help provide metrics of how fast identified issues with incompleteness and inconsistency were being fixed.

The presence of more information about a resource, from a link, can be agreed on as a quality rule for Linked Data – it is one of the Four Principles after all, that one should be able to follow a link and get useful information.

With OWL schemas there is already some modelling of data objects and attributes and their relations. There are rules languages from W3C and elsewhere that could be used to automate some quality measurement – RIF and SWRL. These languages require a high level of buy-in to the standards, a rules engine, expertise.

Data package testing be viewed like software package testing. The rules are built up, piece by piece, growing as the code does, ideally. The methods used can be quite ad-hoc, use different frameworks and structures, as long as the results are repeatable and the coverage is thorough.

Not everyone will have the time or patience to run quality tests on their local copy of the data before use, so we need some way to convey the results. This could be an overall score, a count of completeness errors – something like the results of a software test run:

3 items had no tests...
9 tests in 4 items.
9 passed and 0 failed.
Test passed.

For quality improvement, one needs to see the detail of what is missing. Essentially this is a picture of a data model with missing pieces. It would look a bit like the content of a SPARQL query:

a scv:Dataset .
dc:title ?title .
scv:datasetOf ?package .
etc...

After writing this I was pointed at WIQA, a Linked Data quality specification language by the group behind dbpedia and Linked GeoData, which basically implements this with a SPARQL-like syntax. I would like to know more about in-the-wild use of WIQA and integration back into annotation tools…


Dev8D: JISC Developer Days

March 5, 2010

The Unlock development team recently attended the Dev8D: JISC Developer Days conference at University College London. The format of the event is fairly loose, with multiple sessions in parallel and the programme created dynamically as the 4 days progressed. Delegates are encouraged to use their feet to seek out what interests them! The idea is simple: developers, mainly (but not exclusively) from academic organisations come together to share ideas, work together and strengthen professional and social connections.

A series of back-to-back 15 minute ‘lightning talks’ ran throughout the conference, I delivered two – describing EDINA’s Unlock services and showing users how to get started with the Unlock Places APIs. Discussions after the talk focused on the question of open sourcing and the licensing of Unlock Places software generally – and what future open gazetteer data sources we plan to include.

In parallel with the lightning talks, workshop sessions were held on a variety of topics such as linked data, iPhone application development, working with Arduino and the Google app engine.

Competitions
Throughout Dev8D, several competitions or ‘bounties’ were held around different themes. In our competition, delegates had the chance to win a £200 Amazon voucher by entering a prototype application making use of the Unlock Places API. The most innovative and useful application wins!

I gave a quick announcement at the start of the week to discuss the competition, how to get started using the API and then demonstrated a mobile client for the Unlock Places gazetteer as an example of the sort of competition entry we were looking for. This application makes use of the new HTML5 web database functionality – enabling users to download and store Unlock’s feature data offline on a mobile device. Here’s some of the entries:

Marcus Ramsden from Southampton University created a plugin for EPrints, the open access respository software. Using the Unlock Text geoparser, ‘GeoPrints’ extracts locations from documents uploaded to EPrints then provides a mechanism to browse EPrint documents using maps.

Aidan Slingsby from City University, entered some beautiful work displaying point data (in this case a gazetteer of British placenames) shown as as tag-maps, density estimation surfaces and chi surfaces rather than the usual map-pins! The data was based on GeoNames data accessed through the Unlock Places API.

And the winner was… Duncan Davidson from Informatics Ventures, University of Edinburgh. He used the Unlock Places APIs together with Yahoo Pipes to present data on new start-ups and projects around Scotland. Enabling the conversion of data containing local council names into footprints, Unlock Places allowed the data to be mapped using KML and Google Maps, enabling his users to navigate around the data using maps – and search the data using spatial constraints.

Some other interesting items at Dev8D…

  • <sameAs>
    Hugh Glaser from the University of Southampton discussed how sameAs.org works to establish linkage between datasets by managing multiple URIs for Linked Data without an authority. Hugh demonstrated using sameAs.org to locate co-references between different data sets.
  • Mendeley
    Mendeley
    is a research network built around the same principle as last.fm. Jan Reichelt and Ben Dowling discussed how by tracking, sharing and organising journal/article history, Mendeley is designed to help users to discover and keep in touch with similarly minded researchers. I heard of Mendeley last year and was surprised by the large (and rapidly increasing) user base – the collective data from its users is already proving a very powerful resource.
  • Processing
    Need to do rapid visualisation of images, animations or interactions? Processing is Java based sketchbox/IDE which will help you to to visualise your data much quicker. Ross McFarlane from the University of Liverpool gave a quick tutorial of Processing.js, a JavaScript port using <Canvas>, illustrating the power and versatility of this library.
  • Genetic Programming
    This session centred around some basic aspects of Genetic Algorithms/Evolutionary Computing and Emergent properties of evolutionary systems. Delegates focused on creating virtual ants (with Python) to solve mazes and by visualising their creatures with Processing (above), Richard Jones enabled developers to work on something a bit different!
  • Web Security
    Ben Charlton from the University of Kent delivered an excellent walk-through of the most significant and very common threats to web applications. Working from the OWASP Top 10 project, he discussed each threat with real world examples. Great stuff – important for all developers to see.
  • Replicating 3D Printer: RepRap
    Adrian Bowyer demonstrated RepRap – short for Replicating Rapid-prototyper. It’s an open source (GPL) device, able to create robust 3D plastic components (including around half of its own components). Its novel capability of being able to self-copy, with material costs of only €350 makes it accessible to small communities in the developing world as well as individuals in the developed world. His inspiring talk was well received and this super illustration of open information’s far reaching implications captured everyone’s imagination.

All in all, a great conference. A broad spread of topics, with the right mix of sit-and-listen to get-involved activities. Whilst Dev8D is a fairly chaotic event, it’s clear that it generates a wealth of great ideas, contacts and even new products and services for academia. See Dev8D’s Happy Stories page for a record of some of the outcomes. I’m now looking forward to seeing how some of the prototypes evolve and I’m definitely looking forward to Dev8D 2011.


A very long list of census placenames

February 9, 2010

Nicola Farnworth from the UK Data Archive sent us a motherlode of user-contributed UK placenames – a list extracted from the 1881 census returns. The list is 910096 lines long.

A corner of a page of a census record

Many placenames have the name of a containing county, though some don’t. The data is full of errors, mistakes in the original records, mis-heard names, maybe errors in transcription.

This census placename data badly needs a quality audit; how can Unlock Places help provide links to location references and clean up messy location data?

I made a start at this over the weekend, because I also wanted an excuse to play with the redis nosql data store.

To start, I threw the list of unique placenames against the geonames.org names in the Unlock Places API. The gazetteer is used to ground the placename list against known places, rather than search for exact locations at this stage, we look for known-to-exist-as-place names. The search function I used, closestMatchSearch, does a fulltext search for very close matches. It took getting on for 36 hours to run the whole lot.

unique placenames: 667513
known by geonames: 34180
unknown by geonames: 633333

We might hope for more, but this is a place to start. On manual inspection I noticed small settlements that are definitely in OpenStreetmap’s data. The Ordnance Survey 50K gazetteer, were it open data, would likely yield more initial matches.

Next, each of the unlocated placenames is compared to the grounded group of places, and if one name is very similar to another (as measured by Levenshtein distance with a handy python module) then a reference is stored that one place is the sameAs another.

Based on the results of a test run, this string similarity test should yield at least 100,000 identities between placenames. Hard to say at this stage how many will be in some kind of error (Easton matching Aston), 1 in 20 or hopefully many fewer.

place:sameas:WELBOURN : place:WELBURN
place:sameas:WELBOURY : place:WELBURY
place:sameas:ALSHORNE : place:ASHORNE
place:sameas:PHURLIGH : place:PURLEIGH
place:sameas:LANGATHN : place:LLANGATHEN
place:sameas:WIGISTON : place:WIGSTON
place:sameas:ALSHORPE : place:ASHOPE
place:sameas:PELSCHAM : place:ELSHAM

As I next stage, I plan to run the similarity test again, on the placenames derived from it in the first stage, with a higher threshold for similarity.

This should start getting the placenames yet to be located down to a manageable few hundred thousand. I hope to run the remaining set against OpenStreetmap’s Nominatim geocoding search service. I should probably write to them and mention this.

There’s more to be done in cleaning and splitting the data. Some placenames are really addresses (which may well turn up through Nominatim) others are sub-regions or suburbs attached to other placenames, north/south/east/west prefixes.

What next?

Ultimately there will be a large set of possible placenames, many tens of thousands, which aren’t reliably found in any gazetteer. How to address this?

A human annotator can be assisted by programs. We have a high threshold of acceptance for similarity of names for automatic link creation; we can lower that threshold a lot if a human is attesting to the result.

We can also look at sound similarity algorithms like soundex and metaphone. There are concerns that this would have an unacceptable rate of false positives, but if a human annotator is intervening anyway, why not show rough-guess suggestions?

A link back to the original source records would be of much benefit. Presumably the records come in sequences or sets which all deal with the same geographic region, more or less. By looking at clusters of placenames in a set of related documents, we can help pinpoint the location on a map (perhaps even pick out a name from a vector map layer).

Records with unknown placenames can be roughly located near the places of related records.

How close is close enough for search? If the record is floating near the street, or the neighbourhood, that it belongs in, is that close enough?

And where people need micro-detail location and other annotations, how can they best provide their improvements for re-use by others?


Places you won’t find in any dictionary

January 12, 2010

Tobar an Dualchais is an amazing archive of Gaelic and Scots speech and song samples. Under the hood, each of their records is annotated with places – the names of the village, or island, or parish, where the speaker came from.

We’ve been trying to Unlock their placename data, so the names can be given map coordinates, and the recordings searched by location. Also, I wanted to see how much difference it would make if the Ordnance Survey 50K gazetteer were open licensed, thus enabling us to use it for this (non-research) project.

Out of 1628 placenames, we found 851 exact matches in the 50K gazetteer and 1031 in the geonames.org gazetteer. Just 90 placenames were in the 50K but not in geonames. There’s a group of 296 placenames that we couldn’t find in any of our gazetteer data sources. Note that this an unusual sample, focused on remote and infrequently surveyed places in the Highland and Islands, but I had hoped for more from the 50K coverage.

There are quite a few fun reasons why there are so many placenames that you won’t find in any dictionary:

  • Places that are historic don’t appear in our contemporary OS sources. Many administrative areas in Scotland changed in 1974, and current OS data does not have the old names or boundaries. Geonames has some locations for historic places (e.g. approximate centroids for the old counties) though without time ranges.
  • Typographical errors in data entry. E.g. “Stornooway” and “Stornaway” – using the gazetteer web service at the content creation stage would help with this.
  • Listings for places that are too small to be in a mid-scale gazetteer. For example, TAD data includes placenames for buildings belonging to clubs and societies where Gaelic sound recordings were made. Likely enough, some small settlements have escaped the notice of surveyors for OS and contributors to geonames.
  • Some places exist socially but not administratively. For example, our MasterMap gazetteer has records for a “Clanyard Bay”, “Clanyard House”, “Clanyard Mill” but not Clanyard itself. The Gazetteer for Scotland describes Clanyard as “a locality, made up of settlements” – High, Low and Middle Clanyards.
  • Geonames has local variant spellings as alternative names, and these show up in our gazetteer search, returning the more “authoritative” name.
  • Limitations in automated search for descriptions of names. For example, some placenames look like Terregles (DFS) see also Kirkcudbrightshire. I’m hoping the new work on fulltext search will help to address this – but there will always need to be a human confirmation stage, and fixes to the original records.

It’s been invaluable to have a big set of known-to-be-placenames contributed in free-text fields by people who aren’t geographers. I would like to do more of this.

I saw a beautiful transcript of an Ordnance Survey Object Name Book on a visit to RCAHMS. Apparently many for the English and Welsh ones were destroyed in the war, but the Scottish ones survived. But that is a story for another time.


Linked Data impact and long-term URL preservation

December 21, 2009

After a quick pass through Paul Miller’s draft Linked Data report for JISC, I looked out the notes I had made when we talked in the Black Medicine cafe. There were unusually few notes, for quite a long conversation.

I don’t think we really discussed anything that featured in the draft Linked Data report; not the implementation issues. We talked about the broader implications of linked open data for JISC services, about business models for support of open data, about the upcoming effort on data.gov.uk …

One topic I did take notes on was that of long-term URL preservation – what kind of institution to approach to make a commitment to keep a URL around for 30+ years for the use of, say, a library special collection georeferencing project (and hopefully many others).

Here is an edit to a set of notes I wrote for @simonjbains and others at the Digital Library in Edinburgh. It’s likely this requirement is not unique to geo-services, but bibliography and media archive projects would surely face similar needs to make sure that references really stick around.

RCAHMS was another interesting choice given their involvement in digital gazetteer reference already with
Scotland’s Places.