The Projects initiative, a Digital Science endeavour, provides a desktop app that allows you to comprehensively organise and manage data you produce as research projects progress. The rationale behind Projects is that scientific data needs to be properly managed and preserved if we want it to be perennial: there’s indeed a worrisome trend showcasing that every year, the amount of research data being generated increases by 30%, and yet a massive 80% of scientific data is lost within two decades.
Projects and open science data sharing platform figsharepublished an impressive and pretty telling infographic on science data preservation and chronic mismanagement [scroll down to see it]. What struck me looking at these numbers is neither the high-throughput data production nor the overall funds it requires – 1,5 trillion USD spent on R&D! – but the little to no information on public policies aimed at solving the problem.
It’d be a mistake to consider that access to the research paper is enough. A publication is a summary, a scholarly advertisement of sorts, able in no way, alone, to substitute to raw data, protocol and experiment details, and – when applicable – software source code used to run the analysis. Yet, while we are an ever-increasing number of journals open up scientific publications, researchers and their respective institutions trail in involvement when it boils to sharing science data. Such laziness is not harmless: the infographic highlights that 80% (!) of datasets over 20 years old are not available.
Such a delirious figure is still just the tip of the iceberg: every time we produce data, we also generate metadata (“data about data”) and protocols (descriptions of methods, analysis and conclusions). Guess what, as files quickly pile up and are mismanaged, all that stuff falls into oblivion.
This also means that data we produce today is not accessible to the broader research community. A large amount of experiments gives negative or neutral results, thus not allowing to confirm the work hypotheses. This is an issue on two counts. First, we waste our time, energy and brains on redoing what colleagues have done and which does not work. But since data is not shared, we joyfully dive into writing grants to ask for money to eventually produce data that will not end up in a paper… as publications today only account for ‘positive’ results (i.e., supporting the work hypotheses).
Second issue around with-holding data sharing is the impossibility to repeat or even statistically verify a study being presented. This has a name: reproducible research. We have all heard about the shocking outcome of Glenn Begley’s survey of 53 landmark cancer research publications (hint: only six out of them could be independently reproduced). The infographic below shows a bit different yet frightening picture: 54% of the resources used across 238 published studies could not be identified, making verification impossible. Sticking the knife in deeper, the infographic also highlights that the number of retractions due to error and fraud has grown fivefold since 1990. This complements another estimation showing that the number of retracted papers has grown tenfold since 2000.
We need public policies to the rescue. Funding bodies and various other institutions start to demand improved data management, tells the infographic, citing the “Declaration on Access to Research Data from Public Funding” and the NIH, MRC and Wellcome Trust (these now request data management plans be part of applications). The EU has also committed to considering data produced in publicly-funded studies as ‘public data’, thus aligning its sharing with other public sector data in a broader Open Data move. The European Commission thus launched a Pilot on Open Research Data in Horizon 2020.
P.S. And in case you need additional incentives for data sharing, have a read.
P.P.S. From what I’ve heard, people at Projects are interested to hear your views on data availability and how you manage your own data, so get in touch on Twitter @projects.