Notes from EEO talk on population modelling with GIS

March 22, 2010

David Martin spoke in the EEO seminar series last Friday. Here are my notes:

In the last decades we have become “sophisticated in our tools, but our fundamental techniques and results aren’t very different”. Census data is not the same as demographic data, however census approaches to modelling population have become dominant – a “long-term reliance on census-based shaded area map to inform spatial decision-making.

Importance of small area population mapping for policy – resource allocation and site location decisions, calculation of prevalence rates. “Who is present in a small area, and what characteristics do they have”. A house or flat becomes a “proxy” for a person, who is tied to the space.

This doesn’t give a clear usage picture, specifically it is night-time activity rather than day time which has very different patterns of repetition and variation of movement.

More general problems with census-taking –

  • underenumeration
  • infrequency
  • spatially concentrated error

“We could cut the city differently and produce variations in the pattern” – research in automated generation of census zones, looking for areas with social homogeneity, size, population, based on previous samplings.

“Population distribution is not space-filling but is quasi-continuous”.

“Interest in surfaces, grids and dasymetric approaches”. Using a grid to slice and visualise population data. The grid gives us a finer grained depiction of actualy activity.

Interestingly, shift in government policy regarding census taking. Rapid development of space, and new tech, cause problems – people are more mobile, with multiple bases; concerns about data privacy are more mainstream.
The US Census Bureau has dropped the “long-form” return which used to go to one in six recipients. In France the idea of a periodic census has been dropped completely, they now conduct a “rolling census” compiled from different data sources.

“Register-based sources” – e.g. demographic data is held by health services, local government, transport providers, business associations, communications companies. It’s possible to “produce something census-like”, but richer, by correlating these sources.

Also the cross-section of other sources gives an idea of where census records are flawed and persistently inaccurate, e.g. council tax records not corresponding to where people claim they live.

Towards new representations of time-space

Temporal issues still neglected by geodata specialists, in fact some of the issues are gnarlier and trickier than spatial representation is.

space–time specific population surface modelling.

Dr Martin identified “emergent issues” affecting this practise- “Spatial units, data sources as streams, representational concepts”. His group has a some software in development to document the algorithm for gridding data space – I wanted to ask whether the software and implicitly the algorithm would be released as open source.

A thought about gridded data is that it’s straightforward to recombine (given grid cells for different sources are the same size). Something like OGC WCS but much, simpler.

Advertisements

A very long list of census placenames

February 9, 2010

Nicola Farnworth from the UK Data Archive sent us a motherlode of user-contributed UK placenames – a list extracted from the 1881 census returns. The list is 910096 lines long.

A corner of a page of a census record

Many placenames have the name of a containing county, though some don’t. The data is full of errors, mistakes in the original records, mis-heard names, maybe errors in transcription.

This census placename data badly needs a quality audit; how can Unlock Places help provide links to location references and clean up messy location data?

I made a start at this over the weekend, because I also wanted an excuse to play with the redis nosql data store.

To start, I threw the list of unique placenames against the geonames.org names in the Unlock Places API. The gazetteer is used to ground the placename list against known places, rather than search for exact locations at this stage, we look for known-to-exist-as-place names. The search function I used, closestMatchSearch, does a fulltext search for very close matches. It took getting on for 36 hours to run the whole lot.

unique placenames: 667513
known by geonames: 34180
unknown by geonames: 633333

We might hope for more, but this is a place to start. On manual inspection I noticed small settlements that are definitely in OpenStreetmap’s data. The Ordnance Survey 50K gazetteer, were it open data, would likely yield more initial matches.

Next, each of the unlocated placenames is compared to the grounded group of places, and if one name is very similar to another (as measured by Levenshtein distance with a handy python module) then a reference is stored that one place is the sameAs another.

Based on the results of a test run, this string similarity test should yield at least 100,000 identities between placenames. Hard to say at this stage how many will be in some kind of error (Easton matching Aston), 1 in 20 or hopefully many fewer.

place:sameas:WELBOURN : place:WELBURN
place:sameas:WELBOURY : place:WELBURY
place:sameas:ALSHORNE : place:ASHORNE
place:sameas:PHURLIGH : place:PURLEIGH
place:sameas:LANGATHN : place:LLANGATHEN
place:sameas:WIGISTON : place:WIGSTON
place:sameas:ALSHORPE : place:ASHOPE
place:sameas:PELSCHAM : place:ELSHAM

As I next stage, I plan to run the similarity test again, on the placenames derived from it in the first stage, with a higher threshold for similarity.

This should start getting the placenames yet to be located down to a manageable few hundred thousand. I hope to run the remaining set against OpenStreetmap’s Nominatim geocoding search service. I should probably write to them and mention this.

There’s more to be done in cleaning and splitting the data. Some placenames are really addresses (which may well turn up through Nominatim) others are sub-regions or suburbs attached to other placenames, north/south/east/west prefixes.

What next?

Ultimately there will be a large set of possible placenames, many tens of thousands, which aren’t reliably found in any gazetteer. How to address this?

A human annotator can be assisted by programs. We have a high threshold of acceptance for similarity of names for automatic link creation; we can lower that threshold a lot if a human is attesting to the result.

We can also look at sound similarity algorithms like soundex and metaphone. There are concerns that this would have an unacceptable rate of false positives, but if a human annotator is intervening anyway, why not show rough-guess suggestions?

A link back to the original source records would be of much benefit. Presumably the records come in sequences or sets which all deal with the same geographic region, more or less. By looking at clusters of placenames in a set of related documents, we can help pinpoint the location on a map (perhaps even pick out a name from a vector map layer).

Records with unknown placenames can be roughly located near the places of related records.

How close is close enough for search? If the record is floating near the street, or the neighbourhood, that it belongs in, is that close enough?

And where people need micro-detail location and other annotations, how can they best provide their improvements for re-use by others?