Interesting new model for data sharing

Mike Schatz has posted an interesting paper on Biorxiv on ‘The next 20 years of genome research‘.

In it, he argues that in future ‘it will become less and less practical to transfer data into these [NCBI/EBI] archives as they exist today’ and that ‘In its place, we will see the rise of federated approaches for exchanging biological data, especially computing centers dedicated to large sequencing facilities…The major technical reason this model will become more widespread is that at large scales it is increasingly more efficient to upload code segments, often measured in kilobytes to megabytes, rather than to download entire large collections’.

He obviously points out that genome data is most powerful when aggregated and combined; ‘Therefore, the resources will have to establish common application programming interfaces (APIs) to enable remote access to their data…although even the most basic of federated tasks, so-called “Beacons” that identify if a resource has any individual with a particular mutation, are proving to be difficult to implement for mostly non-technical reasons’.

This got me thinking about how this relates to microbial genomics. Firstly, we don’t have the same problems of scale of transfer, seeing as our genomes are 1000x smaller than the humans. Therefore, it is arguable that the main technical reason for this ‘federalisation’ of sequencing repositories fades away. Secondly, standardising APIs across, what are essentially, competing consortia of sequencing entities, would be a major technical challenge. After all, standards are like toothbrushes, everyone has one but no one wants to use someone elses.

Thirdly, even if there is agreement on the technical aspects, the non-technical reasons he alludes to, the personality clashes that often seem to occur at the top of fields, as well as different cultures and legal obligations, could put the kibosh on this. I worry that people will use this document as a shield for not sharing data at all (I’m sure this isn’t what Mike is getting at).

BTW, things in biorxiv are under CC-BY-NC 4.0 license. According to that license, I’m supposed to link to the license when I use the content.


2 thoughts on “Interesting new model for data sharing

