Notes on Linked Data and Geodata Quality

March 15, 2010

This is a long post talking about geospatial data quality background before moving on to Linked Data about halfway. I should probably try to break this down into smaller posts – “if I had more time, I would write less”.

Through EDINA‘s involvement with the ESDIN project between mapping and cadastral agencies (NMCAs) across Europe, I’ve picked up a bit about data quality theory (at least as it applies to geography). One of ESDIN’s goals is a common quality model for the network of cooperating NMCAs.

I’ve also been admiring Muki Haklay’s work on assessing data quality of collaborative OpenStreetmap data using comparable national mapping agency data. His recent assessment of OSM and Google MapMaker’s Haiti streetmaps showed the benefit of analytical data quality work, helping users assess how what they have matches the world, assisting with conflation to join different spatial databases together.

Today I was pointed at Martijn Van Exel’s presentation at WhereCamp EU on “map quality”, ending with a consideration of how to measure quality in OpenStreetmap. Are map and underlying data quite different when we think about quality?

The ISO specs for data quality have their origins in industrial and military quality assurance – “acceptable lot quality” for samples from a production line. One measurement, “circular error probable“, comes from ballistics design – the circle of error was once a literal circle round successive shots from an automatic weapon, indicating how wide a distance between shots, thus inaccuracy in the weapon, was tolerable.

The ISO 19138 quality models apply to highly detailed data created by national mapping agencies. There’s a need for reproducible quality assessment of other kinds of data, less detailed and less complete, from both commercial and open sources.

The ISO model presents measures of “completeness” and “consistency”. For completeness, an object or an attribute of an object is either present, or not present.

Consistency is a bit more complicated than that. In the ISO model there are error elements, and error measures. The elements are different kinds of error – logical, temporal, positional and thematic. The measures describe how the errors should be reported – as a total count, as a relative rate for a given lot, as a “circular error probable”.

Geographic data quality in this formal sense can be measured, either by a full inspection of a data set or in samples from it, in several ways:

  • Comparison to another data set, ideally of known and high quality
  • Comparing the contents of the dataset, using rules to describe what is expected.
  • Comparing samples of the dataset to the world, e.g. by intensive surveying.

The ISO specs feature a data production process view of quality measurement. NMCAs apply rules and take measurements before publishing data, submitting data to cross-border efforts with neighbouring EU countries, and later after correcting the data to make sure roads join up. Practitioners definitely think in terms of spatial information as networks or graphs, not in terms of maps.

Collaborative Quality Mapping

Muki Haklay’s group used different comparison techniques – in one instance comparing variable-quality data to known high-quality data, in another comparing the relative completeness of two variable-quality data sources.

Not so much thought has gone into the data user’s needs from quality information, as opposed to the data maintainer’s clearer needs. Relatively few specialised users will benefit from knowing the rate of consistency errors vs topological errors – for most people this level of detail won’t provide the confidence needed to reuse the information. The fundamental question is “how good is good enough?” and there is a wide spectrum of answers depending on the goals of each re-user of data.

I also see several use cases for use of quality information to flag up data which is interesting for research or search purposes, but not appropriate to use for navigation or surveying purposes, where errors can be costly.

An example: the “alpha shapes” that were produced by Flickr based on the distribution of geo-tagged images attached to a placename in a gazetteer.

Another example: polygon data produced by bleeding-edge auto-generalisation techniques that may have good results in some areas but bizarre errors in others.

Somewhat obviously, data quality information would be very useful to a data quality improvement drive. GeoFabrik made the OpenStreetmap Inspector tool, highlighting areas where nodes are disconnected or names and feature types for shapes are missing.

Quality testing

What about quality testing? When I worked as a perl programmer I enjoyed the test coverage and documentation coverage packages. A visual interface to show how much progress you’ve made on clearly documenting your code, to show how many decisions that should be tested for integrity remain untested.

Software packages come with a set of tests – ideally these tests will have helped with the development process, as well as providing the user with examples of correct and efficient use of the code, and aiding in automatic installation of packages.

Donald Knuth promoted the idea of “literate programming“, where code fully explains what it is doing. For code, this concept can be extended to “literate testing” of how well software is doing what is expected of it.

At the Digimap 10th Birthday event, Glen Hart from Ordnance Survey Research talked about increasing data usability for Linked Data efforts. I want to link to this the idea of “literate data“, and think about a data-driven approach to quality.

A registry based on CKAN, like data.gov.uk, could benefit from a quality audit. How can one take a quality approach to Linked Data?

To start with, each record has a set of attributes and to reach completeness they should all be filled in. This ranges from data license to maintainer contact information to resource download. Many records inCKAN.netare incomplete. Automated tests could be run on the presence or absence of properties for each package. The results can be display on the web, with option to view the relative quality of package collections belonging groups, or tags. The process would help identify areas that needed focus and followup. It would help to plan and follow progress on turning records into downloadable data packages. Quality testing could help reward groups that were being diligent in maintaining metadata.

The values of properties will have constraints, these can be used to test for quality – links should be reachable, email contact addresses should make at least one response. Locations in the dataset should be near locations in the metadata. Time ranges matching, ditto. Values that should be numbers, actually are numbers.

Some datasets listed in the data.gov.uk catalogues have URLs that don’t dereference, i.e. are links that don’t work. It’s difficult to find out what packages these datasets are attached to, where to get the actual data or contact the maintainers.

To see this in real data, visit the bare SPARQL endpoint at http://services.data.gov.uk/analytics/sparql and paste this query into the search box (it’s looking for everything described as a Dataset, using the scovo vocabulary for statistical data):

PREFIX scv: <http://purl.org/NET/scovo#&gt;

SELECT DISTINCT ?p
WHERE {
?p a scv:Dataset .
}

The response shows a set of URIs which, when you try to look them up to get a full description, return a “Resource not found” error. The presence of a quality test suite would catch this kind of incompleteness early in the release schedule, help provide metrics of how fast identified issues with incompleteness and inconsistency were being fixed.

The presence of more information about a resource, from a link, can be agreed on as a quality rule for Linked Data – it is one of the Four Principles after all, that one should be able to follow a link and get useful information.

With OWL schemas there is already some modelling of data objects and attributes and their relations. There are rules languages from W3C and elsewhere that could be used to automate some quality measurement – RIF and SWRL. These languages require a high level of buy-in to the standards, a rules engine, expertise.

Data package testing be viewed like software package testing. The rules are built up, piece by piece, growing as the code does, ideally. The methods used can be quite ad-hoc, use different frameworks and structures, as long as the results are repeatable and the coverage is thorough.

Not everyone will have the time or patience to run quality tests on their local copy of the data before use, so we need some way to convey the results. This could be an overall score, a count of completeness errors – something like the results of a software test run:

3 items had no tests...
9 tests in 4 items.
9 passed and 0 failed.
Test passed.

For quality improvement, one needs to see the detail of what is missing. Essentially this is a picture of a data model with missing pieces. It would look a bit like the content of a SPARQL query:

a scv:Dataset .
dc:title ?title .
scv:datasetOf ?package .
etc...

After writing this I was pointed at WIQA, a Linked Data quality specification language by the group behind dbpedia and Linked GeoData, which basically implements this with a SPARQL-like syntax. I would like to know more about in-the-wild use of WIQA and integration back into annotation tools…