Wednesday, 30 November 2011

JISC GEO Event – Break Out Session – Repurposing Geospatial Data

A small but passionate group (Shawn Day – Royal Irish Academy, Andrew Bradley – University of Leicester and Dave Carter – University of Leicester) met at Table 3 and discussed this topic.

All parties were advocates for, and very much in favour of, the reuse/repurposing of research data. There are big opportunities for efficiencies and costs savings in terms of data creation/collection, curation and management.

That said, researchers should not assume that repurposing is necessary as ‘straight forward’ and 'easy' at present. Researchers need to be mindful of a number of issues.

A number of points were discussed – the order below does not suggest a hierarchy or priority order.

How do you find what’s available?   - In England there are some national, subject specific and institutional repositories for data but it is not always easy to find out what is available and from where. In Ireland there is a real absence of these type of repositories – where do you go to get data?

Permissions/Licenses – You need to make sure that you can use the data you have found for your specific purpose. Depending on the type of license, then this could mean finding and checking with the ‘real’ data owners.

Ethical issues – In the case of the HALOGEN project we repurposed/reused genetic data. There were sensitivities around this as the personal data had been collected for a specific purpose and not for the purpose we had in mind. We had to apply for/receive approval from the University of Leicester’s Ethics Committee. Is it possible that we were looked on favourably as we were employees of the University? Politics potentially plays a part.

Standards – Specifically relating to geospatial shape file formats, a strong as bias towards commercial standards, specifically ESRI.  There are ‘open’ alternatives but are they really widely used?

Can you ‘understand’ the structure and assumptions of the data you have obtained? – There are many factors that help/hinder this. As a researcher if you want to have confidence in your results then you must be confident you understand the data on which they are based. If you’re repurposing data that logic still applies. You will need to have access to good quality documentation in terms of data glossary’s, data models and data dictionaries.

Metadata standards are crucial.

Ideally you want to be able to talk ideas and issues through with the original creator or someone who is close to the data.

For many data sources,  access to the above is very limited and the quality of documentation is variable. 

You really do need to understand the provenance of the data you are about to use. You need to undertake a risk assessment and think about all the issues up front – if you aren’t comfortable then you don’t use it!

We feel repository managers/data creators have a key role here as to some extent they need to be able to ‘guarantee’ the data they hold. The idea of ‘kite marks’ was discussed.

Are there automated tools that could be used to validate the structure/completeness of data sets (XML checkers were discussed)?

Forcing Researchers to Deposit Data – We feel one of the biggest barriers at present is the fact that researchers don’t necessarily have to make data available. Funders/funding bodies have a key role here as they should make the deposit of data in a ‘repository’ mandatory and specify the terms of any associated licenses.

… and from the above point comes our ‘Number 1’ recommendation – Funding bodies need to stipulate up front as part of any award that data derived as a result of the research they sponsor must be made available in a format that supports reuse and repurposing. This data should be deposited in either an institutional or subject specific repository. Exceptions to this should be allowed but they should be just that – justified exceptions. If each funding body maintained an index of the data sets they had sponsored one of the key barriers (access - in terms of finding out what is available!) would in part be overcome.

Over to you ………..

1 comment:

  1. Ah, I perked up when I started reading the last section.
    Us on table 5, sharing data(which I suppose has strong overlaps with repurposing), came to the same conclusion for a number 1 recommendation.

    Saying funding bodies should contractually-force projects to release data seems rather sad if it's the only way change will come about. Should we also be supporting the sharing by showing the original projects what innovation is done after they share their data? Perhaps repositories should have a comment-like feature for each set so you could say "This was used in [link to some project/new dataset]".