Friday, 8 April 2011

Excitement at the GIS tutorial

Last week I gave a GIS tutorial using some prototype data we cleaned up from the original KEPN, PAS and GUL data sets to Turi, Jayne, Phillip, Dave and Mark W. The object was to help empower these blokes by showing them how to load up the data into a GIS environment and chop up the data with some simple querying methods thus stimulating the construction of new research questions.

Talk about the wow factor - they were really chuffed to see a spatial plot of what they used to know as rows and rows of tabulated data.  After filtering the data suddenly their hypotheses were mapped out in front of them, e.g. Place names with Cornish elements did gravitate to the county of Cornwall, place names with Norse language elements did gravitate to the North and East of England.  When ancillary data such as roads and rivers were plot as background layers I think the cogs and wheels started spinning and ways to answer research questions were suddenly looking so much easier for the researchers.

The data was questioned though, and quite rightly so, it should be a standard procedure for any researcher to be sure of the origins and quality of their data.

(i) Some of the grid references were slipping through our padding procedure and looking too accurate (by this I mean our rounding up of grid references to 0.5km). We did this to ensure privacy of data and maintain a consistent resolution between datasets. This is a small technical issue we need to address.
(ii) Cornish place name elements were detected in Herefordshire and way up in Lancashire.  In retrospect Dave and I examined the original data source a few days later and found that these results were true.  It was the original data that was throwing up the anomalies, technically the HALOGEN team appeared to get things right.

What do we learn from this? Firstly all the hard work is paying off and the researchers find this a really useful tool. Secondly we can only deal with the data we receive. We did our own quality check to be sure we had it right, if the source data is wrong HALOGEN cannot 'make up' data that fits, a strategy of quality control on the original data is required.


Project Evaluation - ArcGIS Data Exchange Issues

We've hit a problem in exchanging data from the HALOGEN databases with ArcGIS. Up until now we've been exporting the data from the databases into files as tab separated fields, importing into ArcGIS and converting into shapefile data from there. However ArcGIS's shapefile format (or more specifically its associated .dbf database file) has some severe limitations which mean we can't realistically continue down this route.

ESRI has this to say about the use of shapefiles:

With some exceptions that are noted below, shapefiles are acceptable for storing simple feature geometry. However, shapefiles have serious problems with attributes. For example, they cannot store null values, they round up numbers, they have poor support for Unicode character strings, they do not allow field names longer than 10 characters, and they cannot store both a date and time in a field. These are just the main issues. Additionally, they do not support capabilities found in geodatabases such as domains and subtypes. So unless you have very simple attributes and no geodatabase capabilities, do not use shapefiles.

(see Geoprocessing Considerations for Shapefile Output).

Of these limitations, I believe the inability to represent NULL values is the most serious - NULLs are represented as zero within the shapefile's .dbf format, and this means that any value of zero within the data cannot be trusted at all.

I am looking at the ArcGIS Data Interoperability extension so that we can plug ArcGIS directly into the HALOGEN database and avoid these problems. Unfortunately there's no fully-featured solution for connecting ArcGIS into MySQL, so for evaluating this I'll be porting the HALOGEN databases to PostgreSQL.


Thursday, 7 April 2011

Project Update for March 2011

In terms of project progress we have achieved less than we planned in March. This is due to one of the key development team members not being available due to a backlog of support work and the need to support another high priority project for researchers at the University (we are developing a hosting service for researchers).

The work on
loading new data sources; addressing known problems with ArcGIS and the Key to English Place Names data has all been impacted.

That said the good news is that to address this problem (hopefully for the life of the project !), and to catch up lost time, the opportunity to recruit additional temporary resource has been taken and Olly Butters joined the team on 4th April for 6 months.

Olly has been working on Physics and Astronomy projects and has good MySQL/PHP skills and a lot of experience of curating multiple/large data sets.

The 1881 Surname/Census data has been obtained and is being reviewed.

A discussion paper covering the tools evaluation has been issued of review.

A training workshop has been held for members of the Roots of the British research collaboration to allow them to get to grips with ArcGIS and HALOGEN data.

Requirements for the 'web enquiry' have been agreed and documented with the research user community.

Key acheivements include: