Wednesday 30 November 2011

JISC GEO Event – Break Out Session – Repurposing Geospatial Data

A small but passionate group (Shawn Day – Royal Irish Academy, Andrew Bradley – University of Leicester and Dave Carter – University of Leicester) met at Table 3 and discussed this topic.

All parties were advocates for, and very much in favour of, the reuse/repurposing of research data. There are big opportunities for efficiencies and costs savings in terms of data creation/collection, curation and management.

That said, researchers should not assume that repurposing is necessary as ‘straight forward’ and 'easy' at present. Researchers need to be mindful of a number of issues.

A number of points were discussed – the order below does not suggest a hierarchy or priority order.

How do you find what’s available?   - In England there are some national, subject specific and institutional repositories for data but it is not always easy to find out what is available and from where. In Ireland there is a real absence of these type of repositories – where do you go to get data?

Permissions/Licenses – You need to make sure that you can use the data you have found for your specific purpose. Depending on the type of license, then this could mean finding and checking with the ‘real’ data owners.

Ethical issues – In the case of the HALOGEN project we repurposed/reused genetic data. There were sensitivities around this as the personal data had been collected for a specific purpose and not for the purpose we had in mind. We had to apply for/receive approval from the University of Leicester’s Ethics Committee. Is it possible that we were looked on favourably as we were employees of the University? Politics potentially plays a part.

Standards – Specifically relating to geospatial shape file formats, a strong as bias towards commercial standards, specifically ESRI.  There are ‘open’ alternatives but are they really widely used?

Can you ‘understand’ the structure and assumptions of the data you have obtained? – There are many factors that help/hinder this. As a researcher if you want to have confidence in your results then you must be confident you understand the data on which they are based. If you’re repurposing data that logic still applies. You will need to have access to good quality documentation in terms of data glossary’s, data models and data dictionaries.

Metadata standards are crucial.

Ideally you want to be able to talk ideas and issues through with the original creator or someone who is close to the data.

For many data sources,  access to the above is very limited and the quality of documentation is variable. 

You really do need to understand the provenance of the data you are about to use. You need to undertake a risk assessment and think about all the issues up front – if you aren’t comfortable then you don’t use it!

We feel repository managers/data creators have a key role here as to some extent they need to be able to ‘guarantee’ the data they hold. The idea of ‘kite marks’ was discussed.

Are there automated tools that could be used to validate the structure/completeness of data sets (XML checkers were discussed)?

Forcing Researchers to Deposit Data – We feel one of the biggest barriers at present is the fact that researchers don’t necessarily have to make data available. Funders/funding bodies have a key role here as they should make the deposit of data in a ‘repository’ mandatory and specify the terms of any associated licenses.

… and from the above point comes our ‘Number 1’ recommendation – Funding bodies need to stipulate up front as part of any award that data derived as a result of the research they sponsor must be made available in a format that supports reuse and repurposing. This data should be deposited in either an institutional or subject specific repository. Exceptions to this should be allowed but they should be just that – justified exceptions. If each funding body maintained an index of the data sets they had sponsored one of the key barriers (access - in terms of finding out what is available!) would in part be overcome.

Over to you ………..

Thursday 10 November 2011

HALOGEN 2 Project - Final Product Post

This final blog post provides information on the primary product, a geospatial visualisation tool called HALO-view for researchers, produced by the HALOGEN 2 project.

Key Elements of HALOGEN2

The HALOGEN 2 project set out with two key objectives which should provide some context for the HALO-view tool we have produced.

Firstly, to develop and deliver better tools to researchers so that people with different levels of technical skill could access and get value from the HALOGEN datasets. HALOGEN (History, Archaeology, Linguistics, Onomastics and GENetics) is cross-disciplinary spatial research database established in 2010 (www.le.ac.uk/halogen). The database holds data for England from the British Museum’s Portable Antiquities Scheme, place name data from the University of Nottingham’s Institute for Name Studies and genetics data from the University of Leicester.

To do this we wanted to create a geospatial visualisation tool and a prototype data extraction tool was developed using Business Objects (http://www.sap.com/uk/solutions/sapbusinessobjects/index.epx) to allow researchers an easy way of accessing HALOGEN data and extracting subsets of data for statistical analyses with tools that they are already familiar with such as SPSS, SAS, Excel.

Secondly, we wanted to both extend the coverage of existing data sources and to add new data sources to the HALOGEN database. This involved acquiring, cleaning, formatting and ingesting two new sources of data into the HALOGEN database. The first of these is data on surname distributions based on 1881 census [1], and the second was additional genetics data from a previously published study [2]. In addition, the existing Portable Antiquities Scheme data was extended to cover the whole of England.

HALO-view Visualisation Tool

HALO-view - a spatial visualisation tool: simplifying and improving access to spatial data

The tool can be accessed at http://halogen.le.ac.uk/ Please email dpc15@le.ac.uk with any feedback.

The HALO-view tool allows non-technical users to query HALOGEN datasets through a simple web-based interface with the results displaying on a Google map.


When you begin using the visualisation tool it will automatically detect your location and display Place Name data associated with your current place name. From this first screen you can run a query to:

  • Discover the meaning of your own chosen place name.
  • View a whole county of place names.
  • Interrogate the entire country by language type and language element.
  • Search for the ‘treasure’ of England in the Portable Antiquities database using county or historical period.
  • Query parish, county or country-wide surname data from the Victorian census in 1881.
  • Explore summarised genetics data by county or counties, and for the specialist by genetic haplogroups.
  • Additional soil and Roman road map overlays to stimulate thinking on patterns and coincidences between the HALOGEN datasets.
For each data set there are a variety of ways to navigate the data: you can choose to explore either exact matches or use “fuzzy searches”; you can view the output either in mapping form or as a flexible tabular output; and you can also zoom in and out of mapped data and switch between satellite and map views.

Searching for English Place-Names

This screenshot of HALO-view shows a search for English Place-Names



If you ‘zoom in’ you can see a lower level distribution of points and by clicking on a ‘point’ you display its name and data on its derivation.




Looking for Treasure!

This screenshot shows a view of the ‘treasure’ of England in the Portable Antiquities database using county or historical period.


Zooming in and clicking on find displays further information.


Querying parish, county or country-wide surname data from the Victorian census in 1881

This screenshot shows the 1881 Census Surname overview


This screen shot shows an example of tabular output relating to the distribution of the Butters surname in Northamptonshire, Nottinghamshire, Leicestershire and Derbyshire.


Soil and Roman road map
The example below shows distribution of roman finds against the roman road overlay.



The only way to get a feel for the tool is to have a go using it.
GO ON GIVE IT A SPIN!

Who Is HALOGEN For?

Initially the project was targeted at two specific groups of researchers at the University of Leicester. The first group was the cross-disciplinary ‘Roots of the British’ collaboration (www2.le.ac.uk/projects/roots-of-the-british), a group of scholars grounded in humanities and genetics. Their mission is to interrogate evidence for the migration and/or continuity of human populations in the British Isles. The second group is from the ‘Impact of Diasporas’ Project (http://www2.le.ac.uk/projects/impact-of-diasporas) who plan to analyse and model relevant migration data. This tool is seen as a key facility for their data analysis work.

Quote from Professor Mark Jobling (Roots of the British Collaboration): "This project has been a very positive experience for us. It was efficiently managed, flexible (accommodating the introduction of new expertise as it went along), and has come up with a useful product that we will continue to develop".

"The novelty and multidisciplinary nature of the project has contributed to the success of other multidisciplinary grant applications, and in turn these will feed back into further developing, and sustaining the HALOGEN resource."

In addition the IT expertise, modernisation and quality control of old database structures has also opened the audience to specific user groups. For example, the project will be replacing the existing web enquiry facilities for the Institute of Name Studies ‘Key to English Place Names’ database run from the University of Nottingham. This will be used by researchers working at and with the Institute and by members of the general public who are interested in place-name etymology and who access the website out of general interest.

Quote from Jayne Carroll, Director of Institute for Name Studies:‘HALOGEN has not only incorporated KEPN into a larger dataset with interesting and potentially significant results, it has added functionality to KEPN as a stand-alone research tool, allowing finer-grained searches and a range of map interfaces, altogether improving on the original.’

What Are Our Future Plans?
It’s the view of the project team, and evidenced in the quotes from our researchers above, that the HALOGEN and HALO-view systems have huge potential. Ideas on improving the products are many and include:
  • Increasing the number of datasets included in the database and available through HALO-view. Discussions are underway at Leicester to consider adding further genetics data that could support genome and population researchers.
  • Improving HALO-view so that it allows users to run queries across multiple datasets at the same time as an aid to easily exploring relationships and patterns between different types of data.
  • Developing and offering a ‘service’ to researchers who do not have access to the level of technical expertise available at Leicester. The idea being in return for a researcher sharing their data, we would clean it, apply spatial data, ingest it into HALOGEN and make their data available back to them through the HALO-view tool. They would then be able to visualise their data against other datasets, and in return another source of data would be made available to our community.
To address the above, funding is required. We feel HALOGEN and HALO-view provide a model for researchers looking to explore multiple complementary geographically referenced data sets. There is an opportunity for other researchers to reuse our approach and tools, as well as learning lessons from our work as they work on projects in their own institutions and disciplines.

Licensing

The outputs from the project will be backed up, managed and supported by IT Services at the University of Leicester for at least 3 years.

All project documentation whatever its form is governed by a license agreement complying with Creative Commons Attribution-Non Commercial- Share Alike 3.0, and all code is licensed under terms compliant with the terms of the GNU General Public License version 3.0. The code will be made publically available at: https://svn.rcs.le.ac.uk/public/halogen/HALOview

References and Documentation

These links provide access to information on the HALOGEN2 data sets and how to use the HALO-view tool:

  • [1] Schürer, K. and Woollard, M., 1881 Census for England and Wales, the Channel Islands and the Isle of Man (Enhanced Version) [computer file]. Genealogical Society of Utah, Federation of Family History Societies, [original data producer(s)]. Colchester, Essex: UK Data Archive [distributor], November 2000. SN: 4177, http://dx.doi.org/10.5255/UKDA-SN-4177-1
  • [2] Capelli, C., Redhead, N., Abernethy, J. K., Gratrix, F., Wilson, J. F., Moen, T., Hervig, T., Richards, M., Stumpf, M. P. H., Underhill, P. A., Bradshaw, P., Shaha, A., Thomas, M. G., Bradman, N. & Goldstein, D. B., A Y Chromosome census of the British Isles (2003). Current Biology 13, 979-984.
  • User guide for visualisation tool: http://halogen.le.ac.uk/guide/
  • Data Glossary for HALOGEN database: http://www2.le.ac.uk/offices/itservices/resources/cs/pso/project-websites/halogen/documents/Data-glossary-V2.3.pdf
  • Technical Guide for Visualisation Tool: (In preparation - link to be added)

Acknowledgements
The team gratefully acknowledge the support and funding provided by JISC, without which this work would not have been possible.



We would also like to thank our partners and data providers, a full list of whom can be found here: http://halogen.le.ac.uk/partners/



Finally we would like to gratefully acknowledge the effort and expertise of all project team members who have helped us to meet our goals and deliver this exciting project as envisioned in our original funding bid:

Top Row – left to right: Anthony Gibson, Senior Application Support Analyst (Business Objects), Dr Olly Butters, Developer and Database Analyst, Dr Jon Wakelin, Research Computing Services Architect, Liam Gretton, Research Computing Services Architect.

Bottom Row – left to right: David Carter, Project Manager, Mark Widdowson, Senior Database Analyst, Dr Andrew Bradley, GIS Specialist.

Table of Contents for Project Blog
Project specific information can be found using the links below:
Project Planning
http://leicesterhalogen2.blogspot.com/2011/03/project-plan-post-1-of-7-aims.html

http://leicesterhalogen2.blogspot.com/2011/03/project-plan-post-2-of-7-wider-benefits.html

http://leicesterhalogen2.blogspot.com/2011/03/project-plan-post-3-of-7-risk-analysis.html

http://leicesterhalogen2.blogspot.com/2011/03/project-plan-post-4-of-7-ipr-licensing.html

http://leicesterhalogen2.blogspot.com/2011/03/project-plan-post-5-of-7-project-team.html

http://leicesterhalogen2.blogspot.com/2011/03/project-plan-post-6-of-7-projected.html

http://leicesterhalogen2.blogspot.com/2011/03/project-plan-post-7-of-7-budget.html

Project Progress Reports
http://leicesterhalogen2.blogspot.com/2011/03/small-wins-fails-progress-update-for.html

http://leicesterhalogen2.blogspot.com/2011/04/project-update-for-march-2011.html

http://leicesterhalogen2.blogspot.com/2011/05/progress-update-for-april.html

http://leicesterhalogen2.blogspot.com/2011/07/progress-update-for-june.html

http://leicesterhalogen2.blogspot.com/2011/08/progress-update-for-august.html

General – Technical Issues, User Feedback and Issues

User excitement at demonstration: http://leicesterhalogen2.blogspot.com/2011_04_01_archive.html
ArcGIS data exchange issues:
http://leicesterhalogen2.blogspot.com/2011/04/project-evaluation-arcgis-data-exchange.html
Open access debate – views & thoughts.
http://leicesterhalogen2.blogspot.com/2011/06/rcuk-and-hefce-announcement-to-support.html
MySQL & ArcGIS.
http://leicesterhalogen2.blogspot.com/2011/08/mysql-server-and-arcgis.html

http://leicesterhalogen2.blogspot.com/2011/10/arcgis-93-mysql.html

Plotting aggregated points over Google Earth satellite images.
http://leicesterhalogen2.blogspot.com/2011/08/plotting-aggregated-points-over-google.html

Friday 28 October 2011

ArcGIS 9.3 -> MySQL

At the core of HALOGEN sits a MySQL database that stores several different geospatial data sets. Each data set is generally made up of several tables and has a coordinate for each data point. Now most of the geo-folk here like to use ArcGIS to do their analysis and since we have it (v9.3) installed system-wide I thought I would plug it into our database. Simple.

As it happens the two don’t like to play nicely at all.

To get the ball rolling I installed the MySQL ODBC so they could communicate. That worked pretty well with ArcGIS being able to see the database and the tables in it. However, trying to do anything with the data was close to impossible. Taking the most simple data set that consisted of one table I could not get it to plot as a map. The problem was the way ArcGIS was interpreting the data types from MySQL; each and every one was being interpreted as a text field. This meant that it couldn’t use the coordinates to plot the data. I would have thought that the ODBC would have given ArcGIS something it could understand, but I guess not. The work around I used for this was to change the data types at the database level to INTs (they were stored as MEDIUMINTs on account of being BNG coordinates). I know this is overkill, and a poor use of storage etc, but as a first attempt at a fix it worked.

Then I moved on to the more complex data sets made up of several tables with rather complex JOINs needed to properly describe the data. This posed a new problem, since I couldn’t work out how to JOIN the data ArcGIS side to a satisfactory level. So the solution I implemented here was to create a VIEW in the database that fully denormalized the data set. This gave ArcGIS all the data it needed in one table (well, not a real table, but you get the idea).

If we take a step back and look at the two ‘fixes’ so far, you can see that they can be easily combined in to one ‘fix’. By recasting the different integers in the original data in the VIEW, I can keep the data types I want in the source data and make ArcGIS think it is seeing what it wants.

And then steps in the final of the little annoyances that got in my way. ArcGIS likes to have an index on the source data. When you create a VIEW there is no index information cascaded through, so again ArcGIS just croaks and you can’t do anything with the data. The rather ugly hack I made to fix this (and if anyone has a better idea I will be glad to hear it) was to create a new table that has the same data types as those presented by the VIEW and do an

INSERT INTO new_table SELECT * FROM the_view

That leaves me with a fully denormalised real table with data types that ArcGIS can understand. Albeit at the price of having a lot of duplicate data hanging around.

Ultimately, if I can’t find a better solution, I will probably have a trigger of some description that copies the data into the new real table when the source data is edited. This would give the researchers real-time access to the most up-to-date data as it is updated by others. Let’s face it, it’s a million times better than the many different Excel spreadsheets that were floating around campus!

Wednesday 31 August 2011

Progress Update for August

The team continue to make good progress.

Data Sources
All the new data sources originally identified have been added to the HALOGEN database.

Tools Development and Evaluation

Development of prototypes of the ‘data extraction tool’ (using Business Objects) and a ‘web based enquiry tool’ continues.

The second iteration of the ‘web based enquiry’ development was presented to Nottingham users in late June and to the Roots of the British/Diaspora researchers at Leicester in July. Feedback from these sessions has been used to enhance its presentation and functionality.

A prototype of the business objects based data extraction tool has been delivered to researchers at Leicester and they are evaluating its use.

Other

A visit from David Flanders - JISC Programme Manager occurred on the 2nd August and highlighted a number of hot topics (most notably data licensing issues and ideas on how to test our deliverables). The feedback from this session was very positive.

Photo below - left to right: Olly Butters, Andrew Bradley, Jonathan Tedds and Dave Carter.


Monday 8 August 2011

MySQL server and ArcGIS

Olly and Liam got to grips with linking ArcGIS and MySQL server last week, essentially they have created a method to allow ArcGIS to talk directly to the HALOGEN data without the need to export data (e.g. as a csv or tab delimited text file). So far it looks like we can query directly on the database. Why is this great news you ask? Before I created an events theme in ArcGIS which was then converted to a shapefile, unfortunately with large volumes of data we exceeded the maximum size of shapefiles and therefore could not query the data any further without a crash. We still need to test and see if relational databases work though, watch this space...

Andrew.

Tuesday 2 August 2011

Plotting aggregated points over Google Earth satellite images

Olly has created a web interface that plots our data on to Google Earth satellite images. Our data is aggregated to the centre of BNG 1km squares to preserve confidentiality and to standardise the resolution of our database as there are several different data sources to compare. I am concerned that end users may forget / not read the project documentation and think that a point marks the exact location of data when in reality it could be anywhere in a km square around the point.  Does anyone else share these concerns - or have a way of reminding an end user of this?

Andrew.

Tuesday 5 July 2011

Progress Update for June

Good progress is being made and we have now delivered the new data sources as planned!

Data Sources

A key target was to complete the load of the new data sources to HALOGEN during June and this has been achieved. Well done Olly and Andrew !

The coverage of the PAS data has been increased to all of England. The Capelli data and 1881 Surname census data has now been added to the database.  

The load of an additional source of surname data has been requested and will be added over the next few months.

Tools Development and Evaluation

Development of prototypes of the ‘data extraction tool’ (using Business Objects) and a ‘web based enquiry tool’ continues.

A plan for the iterative development of the ‘web based’ enquiry has been agreed. The second iteration of development is now complete and the system was presented to users at Nottingham University's Institute of Name Studies on 28th June.


Other

A communications plan to support internal dissemination activity relating to the project has been drafted for Board approval.

We had our first Project Board meeting on 27th May and that went well.

Dave Flanders, our JISC Programme Manager has scheduled a visit for 2nd August.

Tuesday 7 June 2011

RCUK and HEFCE announcement to support Open Access


A majority of researchers would probably agree that Open Access is a positive way to disseminate research and reach a wider audience and the wider audience would probably appreciate their right to view research once barriers are lifted.   Removing the cost or avoiding the sale of a product is and has been shown to increase accessibility – take the aerial shots on Google Earth for example, how many home computer users have not snatched a look at their back yard or favourite holiday destination? The process is the same for documentation.
We sat down and discussed the number of ways that articles are being made and sourced as Open Access.  Our different backgrounds unearthed that we are often unaware of particular systems and procedures in disciplines outside our own fields and operations in other institutions. This serves to illustrate how complicated and mushrooming the idea of Open Access is, and how limited forms of Open Access have existed for a number of years.  Some kind of structuring and linkage between Open Access concepts appears a good way forward so the news of support from RCUK and HEFCE is welcome.
There are slight concerns about the clash of the peer review and so called pay to publish options (we realise these are the extremes and some hybrid versions are in place), as these can compromise and conflict with mounting pressures in the academic world.  Academics are aiming for journals with high impact (to meet REF needs), the ones outside the public domain that often come with the subscription.  The REF pressures far outweigh the need to dose every man on the street with detailed research findings.  On the other hand allowing Open Access to research and project documentation is an alternative opportunity to champion and publicise the achievements to similar and interested academic audiences whilst the general public can cast their eye over our achievements if they choose. Which system should we allow ourselves to gravitate to? 
How does this influence the HALOGEN group?  Much of our documentation is ‘white paper’ information on what and how we do things, information that we are willing (and proud!) to present to an open audience which also serves as a publicity agent and in effect enhances the purpose of our work.  In fact much of our documentation is (or will be) available, we are encouraged to contribute to our University repository and we have our HALOGEN project website.  The website is a source of information at different levels; short summaries for those with a passing interest and then the links and downloads provide detail to the audience who need to be more interactive or choose to know more depth in what we do.  There is an opportunity with Open Access to contribute to a bigger more widely accessible repository – but guidance is essential and we wait to hear the outcomes of this recent announcement from RCUK and HEFCE.
Andrew, Olly and Dave.

Wednesday 11 May 2011

Progress Update for April

Good news - the resource issue that has held up progress in March and April has been resolved.
Olly Butters has joined the team and made an immediate difference.

Progress is summarised below:

Data Sources
 
The new Cappelli date source is now loaded into the database and being tested.

The 1881 Surname Census data has been reviewed and work to calculate parish centroids is well advanced. Once we have parish centoid data then the information can be added to the database and tested.

Tools for Researchers

The requirements for the 'web based query tool' have been reviewed and agreed with Jayne Carroll of Nottingham University's Institute of Name Studies (the owners of the Key to English Place Names data used by HALOGEN).

Work on developing prototypes has started and the target is to demonstrate the first cut products to the project team and key research users in late May/early June.

The first Project Board meeting is now scheduled for 27th May. Onwards !

Friday 8 April 2011

Excitement at the GIS tutorial

Last week I gave a GIS tutorial using some prototype data we cleaned up from the original KEPN, PAS and GUL data sets to Turi, Jayne, Phillip, Dave and Mark W. The object was to help empower these blokes by showing them how to load up the data into a GIS environment and chop up the data with some simple querying methods thus stimulating the construction of new research questions.

Talk about the wow factor - they were really chuffed to see a spatial plot of what they used to know as rows and rows of tabulated data.  After filtering the data suddenly their hypotheses were mapped out in front of them, e.g. Place names with Cornish elements did gravitate to the county of Cornwall, place names with Norse language elements did gravitate to the North and East of England.  When ancillary data such as roads and rivers were plot as background layers I think the cogs and wheels started spinning and ways to answer research questions were suddenly looking so much easier for the researchers.

The data was questioned though, and quite rightly so, it should be a standard procedure for any researcher to be sure of the origins and quality of their data.

(i) Some of the grid references were slipping through our padding procedure and looking too accurate (by this I mean our rounding up of grid references to 0.5km). We did this to ensure privacy of data and maintain a consistent resolution between datasets. This is a small technical issue we need to address.
(ii) Cornish place name elements were detected in Herefordshire and way up in Lancashire.  In retrospect Dave and I examined the original data source a few days later and found that these results were true.  It was the original data that was throwing up the anomalies, technically the HALOGEN team appeared to get things right.

What do we learn from this? Firstly all the hard work is paying off and the researchers find this a really useful tool. Secondly we can only deal with the data we receive. We did our own quality check to be sure we had it right, if the source data is wrong HALOGEN cannot 'make up' data that fits, a strategy of quality control on the original data is required.

Andrew

Project Evaluation - ArcGIS Data Exchange Issues

We've hit a problem in exchanging data from the HALOGEN databases with ArcGIS. Up until now we've been exporting the data from the databases into files as tab separated fields, importing into ArcGIS and converting into shapefile data from there. However ArcGIS's shapefile format (or more specifically its associated .dbf database file) has some severe limitations which mean we can't realistically continue down this route.

ESRI has this to say about the use of shapefiles:

With some exceptions that are noted below, shapefiles are acceptable for storing simple feature geometry. However, shapefiles have serious problems with attributes. For example, they cannot store null values, they round up numbers, they have poor support for Unicode character strings, they do not allow field names longer than 10 characters, and they cannot store both a date and time in a field. These are just the main issues. Additionally, they do not support capabilities found in geodatabases such as domains and subtypes. So unless you have very simple attributes and no geodatabase capabilities, do not use shapefiles.

(see Geoprocessing Considerations for Shapefile Output).

Of these limitations, I believe the inability to represent NULL values is the most serious - NULLs are represented as zero within the shapefile's .dbf format, and this means that any value of zero within the data cannot be trusted at all.

I am looking at the ArcGIS Data Interoperability extension so that we can plug ArcGIS directly into the HALOGEN database and avoid these problems. Unfortunately there's no fully-featured solution for connecting ArcGIS into MySQL, so for evaluating this I'll be porting the HALOGEN databases to PostgreSQL.

Liam

Thursday 7 April 2011

Project Update for March 2011

In terms of project progress we have achieved less than we planned in March. This is due to one of the key development team members not being available due to a backlog of support work and the need to support another high priority project for researchers at the University (we are developing a hosting service for researchers).

The work on
loading new data sources; addressing known problems with ArcGIS and the Key to English Place Names data has all been impacted.

That said the good news is that to address this problem (hopefully for the life of the project !), and to catch up lost time, the opportunity to recruit additional temporary resource has been taken and Olly Butters joined the team on 4th April for 6 months.

Olly has been working on Physics and Astronomy projects and has good MySQL/PHP skills and a lot of experience of curating multiple/large data sets.


The 1881 Surname/Census data has been obtained and is being reviewed.

A discussion paper covering the tools evaluation has been issued of review.

A training workshop has been held for members of the Roots of the British research collaboration to allow them to get to grips with ArcGIS and HALOGEN data.

Requirements for the 'web enquiry' have been agreed and documented with the research user community.


Key acheivements include:

Wednesday 9 March 2011

"Small WIN(s") & "FAIL(s)" - Progress Update for February 2011

February Update & Issues

The project has now formally started and an internal project proposal has been approved.

Resources have been allocated and the contract of our GIS specilist has been extended to cover the specialist GIS resource required to support this project.

Good progress has been made on the Data Sources Work Package (WP2). The HALOGEN system has been extended to include a full down load of Portable Antiquities Data (PAS) for all English counties. The increase in the volume of PAS data has created problems with the use of the ArcGIS tool used by the team (see issue below).

The project team have changed the scope of the Data Sources work slightly and agreed that 2 new sources of data will be added. These will be additional Genetics data from the published datasets of Capelli and the 1881 Surname/Census data from Kevin Schurer’s work with the Essex Data Archive at the University of Essex.

This later source is a replacement for the surname distribution data from the Archer Surname Atlas originally referenced in the project proposal. Initial investigations identified issues with the use of this data which have been addressed by the Essex Data Archive.

Requirements relating to the Capelli data have been agreed and the source data obtained. The data has been enhanced by the addition of relevant spatial data. Design work to include this in the HALOGEN database is in progress.

An initial meeting has been held to review the format and content of the surname census data from the Essex Data Archive and further sessions are planned.

Extension of PAS Data – Impact on ArcGIS


The addition of extra PAS data has caused problems with the use of shape files in ArcGIS. To temporarily overcome this issue PAS data is being extracted in 3 separate files for use with ArcGIS. The team continue to investigate a permanent fix to the problem.

Any hints or tips welcome !

Project Plan Post 2 of 7: Wider Benefits to Sector & Achievements for Host Institution

The benefits from specific project deliverables are listed in the table below.


Deliverables
Benefits and Outcomes
1. Enhanced Data Sources

– Extension of existing PAS datasets to provide national coverage. This is a relatively simple exercise as the Key to English Place Names and Genetics data is already present in the database at national level but some additional cleaning is required. A new ‘national’ extract of PAS data would be needed.

- Addition of 2 new data sources relating to the geographical distribution of surnames and further genetics data to the database.
A review of existing HALOGEN data extraction, cleaning and load procedures to cope with ingestion of ‘national’/larger data sources.

A prioritised list of data source related requirements that can be used to guide the future development of the service.

Data extraction, cleaning and load procedures for new datasets. Updated data glossary for researchers.
2. A Revised Data Management Plan

Additional requirements, policies and practice recommendations covering new features will be documented.
An assessment of the effectiveness of the DCC’s drafts DMP as an aid to research data management.

Information and lessons learned to source a JISC case study and input to community synthesis project.

3. Evaluation and Selection of Data Extraction Tool

Requirements for an appropriate tool will be documented. These will initially be used to assess the feasibility of using the existing tools supported by IT Services. If this is not appropriate a market evaluation will take place.
A contribution to wider JISC community to help develop awareness of good practice in terms of the selection and availability of similar tools.
4. An Implementation Plan for the Data Extraction Tool
An implementation plan covering the timescales, costs, risks and issues relating to the deployment of the selected tools will be documented. If feasible within the 9-month project window then the preferred product will be implemented.

5. Feasibility Study for the Provision of HALOGEN Database Enquiry Facilities for Institute of Place Names Website Users.
A contribution to wider JISC community to help develop awareness of good practice in how to deal with similar requirements and problems.
6. Interim project reports
Compliance with JISC requirements for project control.
7. A project blog and updated Halogen Project website
Sector-wide dissemination of findings and engagement with key stakeholder communities.

Project Plan Post 7 of 7: Budget

Budget Summary

The total cost of the project was c.£353,000 of which the JISC award of  £85,000 represents 24% of the total project costs. A summary of the total project budget (covering both JISC and institutional contributions) is given below.

Category                                        %

Directly Incurred Staff                      5
Directly Incurred Other                    7
Directly Allocated                          50
Indirect                                         
38

Total                                           100

The largest forecasted costs relate to staffing. It is estimated that a 'virtual team' of 3.27 FTE's will work on the project for it's duration.


Budget Management

The project manager will be responsible for managing and monitoring the project budget on a day to day basis. The project manager will be accountable to the Project Board and will report any significant variances to the Board for discussion and authorisation.

The project board is chaired by Professor Annette Cashmore (Sub-Dean for Medicine and Biological Sciences, Director of CETL (GENIE)) and its membership includes David Flanders, JISC Programme Manager, Professor Mark Jobling (Department of Genetics), Dr Jayne Carroll (Director Institute for Name Studies, University of Nottingham), Mary Visser (Director of IT Services) and Dr Nick Tate (Senior Lecturer, Department of Geography).

Project Plan Post 6 of 7: Projected Timeline, Workplan & Overall Project Methodology

Work Plan The high level workplan is outlined below. Listed against each workpackage are the initials of the principal team member(s) responsible for its delivery.

Months ð
Workpackage ò                       
02
11
03
11
04
11
05
11
06
11
07
11
08
11
09
11
10
11
WP1 - Set-up and governance (DC)

Project set-up, induction and PID









Steering and project group meetings









JISC Programme level activity and reporting to funder









WP2 – Data sources (AB, OB, LG)

Extend coverage of PAS data









Additional data sources - establish detailed requirements and investigate feasibility for 2 new sources.









Extract, clean, transform and load new data sources









Update data glossary and system documentation









WP3 – Update HALOGEN data management plan (DC)

Review and update DMP









WP4 – Data Extraction Tool (AG, MW)

Establish and document requirements









Evaluate tools and select preferred supplier









Procure and produce implementation plan for preferred tool/supplier









WP5 – Web Enquiry Facility (OB)

Establish requirements









Investigate options and feasibility of delivery









Document and publish findings









WP6 – Develop & Disseminate HALOGEN Case Study (DC, OB, AB)









Prepare case study re:community syntheses input









Disseminate to stakeholders









WP7 - Project Evaluation (DC)

Develop interim and final project reports (see evaluation milestones below)









WP8 – Dissemination (DC, OB, AB)

Maintain project blog, halogen web site and run briefing sessions











Project Approach

When identifying and evaluating specific tools and technologies, open source solutions will be considered alongside those which are already licensed by the University, the aim being to reduce the ‘barrier to entry’ for other institutions wishing to adopt the approaches used by UoL.

The evaluation will involve some desk based research but will be heavily biased towards the building of 'prototypes' using different tools.