7 What Data To Publish
Data Sharing policies and privacy protection considerations
In the general case when sharing research data as much detail as possible should be shared to maximize the utility of the data to the scientific community. When dealing with human data however considerations of consent and privacy are paramount.
When considering what data should or should not be deposited publicly you should consult the consent agreements signed by the tissue donors of samples used in your study. When these samples come from a tissue bank you should consult the tissue bank where you should be able to find details of how participants where consented. There will also generally be guidance on what data they are able to share with you as a researcher and which portions of that data should not be made generally available. Many of the tissue samples used by HDBI researchers are from the Human Developmental Biology Resource (HDBR). For additional details on HDBR policy about the sharing of data relating to their sample see the HDBR’s data sharing policy
Data which pertains to technical or biological properties of the sample are generally suitable to be shared. Data about the donor of the sample will generally not be suitable to be shared. This will depend on the specific consent agreements associated with a given sample and should not be assumed.
Data points like sample collection dates can be a problematic, where time between two points in a sample’s history might be relevant e.g. time from collection to processing this is better expressed in relative time rather than absolute dates.
7.1 Additional Ethical & Legal Considerations
Context to be aware of when making data-sharing policy decisions
Where samples have been donated under the condition that any published data be pseudonymised care should be taken to minimize re-identification risk. i.e. the risk that the identify of the donor could be re-associated with the data. One of the challenges in this space is that any genetic data now carries inherent re-identification risk.
Re-identification attacks are generally carried out by joining together disparate pieces of information which are not separately identifying but which when combined can uniquely or probabilistically identify an individual. These attacks can take forms which are difficult to anticipate.
For example cross-referencing a list of names and clinic appointment times with sample collection dates could substantially narrow the field of possible donors of a given sample, potentially even uniquely identifying the donor. This only requires two mappings, name to appointment data and sample to collection date. If genetic information is tied to sample collection date and there are still more than one possible person who could be the donor this could be employed to attempt to narrow down identities further. Genetic data can be used to estimate ancestry and ethnic origin which could probabilistically re-identify someone with a name of characteristic ethnic or geographic origin. Even sequencing data which is not primarily intended for genotyping such as RNA-seq can be used for this purpose. This paper for example performs a comparison of SNP (Single Nucleotide Polymorphism) calling methods using RNA-seq data (Quinn et al. 2013 [cito:citesAsAuthority]).
Unfortunately there are numerous examples of cybercriminals using open source intelligence (OSINT) techniques and leaked medical data to target and exploit victims. Whilst not yet a popular attack vector with cybercriminals tying an individual’s identity to their (or close relatives’) genetic profile(s) is a potential avenue of attack. Related methods have already been used by law enforcement to catch criminals, a topic which has entered the public discourse with prominent science communication content creators like Veritasium covering the subject.
In the EU the sharing of genetic data is governed by the GDPR. This also largely applies to the UK post brexit but attention should to be paid to any divergences of the UK GDPR from the EU GDPR. Genetic data is defined in the EU GDPR by Recital 34 (see also article 4 definition 13) and its processing permitted for specified purposes where the data subject has given consent (see article 7 & article 4 definition 11) for said processing under article 9. The Public Health Genomics Foundation has produced an extensive report on GDPR and genomics data which aids significantly with the interpretation of this legislation in the context of genomic data. Genetic data in particular presents challenges around who the data subject is as your genetic data is also in part your relatives genetic data.
a nice talk giving a practical look at health data anonymisation under GDPR with an example from Prof. Dr. Fabian Prasser of the Berlin Institute of Health.
In the USA GINA (Genetic information non-discrimination act) 2008 & Health Insurance Portability and Accountability Act (HIPAA) 1996 are the primary pieces of legislation governing genomic data sharing. This talk: De-identification Standards: What Works, What Doesn’t, and What Fails Miserably by Bradley Malin of Vanderbilt University provides in interesting compare and contrast of the the EU and US approaches to this question.
Genetic data sharing in the consumer space is currently something of a regulatory ‘wild-west’ with consumers able to ‘consent’ to genetic data sharing practices that might be prohibited or closely scrutinized for public entities though accepting end user licence agreements (EULAs). These agreements are unilaterally modifiable by the entity collecting the data and it is widely accepted that almost no one reads them.
This has implications for research data sharing as databases of direct to consumer genetic testing results can potentially be used in attempts to re-identify data subjects. Secondary genetic analysis platforms which take data generated by direct to consumer genetic test providers and provide additional analysis results are also becoming more prevalent. These include sites with significant repositories of user submitted data often for the purposes of performing genealogical analysis. These developments represent a lowering of the technical barriers to potential abuses of this data for purposes such as re-identification. A fact that is relevant for the interpretation of whether or not data can be considered anonymous under GDPR recital 26.
HDBI Research Tissues & Ethics page
The Global Alliance for Genomics & Health (GA4GH) an independent, non-governmental, not-for-profit, international association provides a number of toolkits with useful resources for those working with genomic data.