The environment for the publication of data is currently changing rapidly. New data journals emerge, like Scientific Data from Nature two weeks ago or Geoscience Data Journal by Wiley. The latter was also in the focus of the PREPARDE project, which delivered a nice paper on data peer review a couple of weeks ago (Mayernik et al, 2014). Furthermore, more and more funding agency require the publication of data and it is to expect that this demand will lead to more pressure for scientists to make their work publicly available.
These developments are great, but at this point I would like to think further into the future. Where should we be in five or ten years, and what is possible in let’s say 30 or more years. A lot is the answer, but let’s go a little bit more in the details.
I personally hope that the citation of datasets will be the norm rather the exception in five years. And with this I mean really the citation of the data and not the journal article, which described the data. I also hope that scientists get really forced to publish their data after a certain amount of time in which they have the advantage of using it on their own. As a consequence I think that in ten years scientists will feel severe consequences, when they do not do it.
But it is not all about data, another part is the data processing. This could be models (which deliver also data) or the data analysis tools. I personally think that we should go the way within science to enable a proper publication of programming code. We will need something like model journals (some like GMD already exist), and we have to find ways how to properly publish the pure code with its documentation. And of cause the same for the model is true as for the data: we need proper quality assurance procedure. The aim should be that all parts of the scientific process are publishable in the same way we publish journal articles today. Open methods, open data, open models. As a consequence we need procedures to peer review models as well.
The quality evaluation procedure, which I have shown in the article might be also a step in that direction. The missing element here are general quality checks with a probabilistic result for model data. Given that the scheme is at least applicable and we can think about the consequences of the results.
On the long-term I would prefer a fully automated system, which deliver for a given dataset a proper quality report. I think that this will be possible to achieve one day, but many many steps are necessary in the upcoming decades. To illustrate this I include here a little figure.
Currently the tests/parameters, priors, weights have to be defined by the data author or reviewer and therefore a scientists. The data can then be tested and in a next step the results can be analysed, for example with the help of the quality evaluation algorithm. This analysis would then modify the metadata and the enhanced dataset can then be published. To automate these steps several important developments are necessary (in the black boxes). First the automatic detection and classification of datasets. With great metadata and standardisation it might be possible even today, but without this a lot of developments are necessary. The problem is to detect from a dataset not only the type of variable, but also its spatial and temporal variability. The resulting question is, how much metadata will be necessary to generate a good automatic detection.
The test specifications, priors and weights could be generated from past experiences. Systems are able to learn and therefore this should be a solvable problem, when a good detection has taken place and enough datasets are already tested on their quality. The quality evaluation algorithm might be developed further, since I do not claim that my version is the best solution of this problem. Nevertheless, the biggest problem is the analysis. How do we interpret the results of a quality evaluation and create a usable automatic addition to the documentation. The available information are the test specifications and priors and theoretically there could be a way to use this information to generate useful additions. Unfortunately I do not expect a solution soon on this and perhaps the whole mechanism needs a rethink to make this possible.
Whatever the different forms of publication will look like, it might lead to severe changes in the scientific ways of work. Not only the additional workload and the opportunity to use more parts of the work already done by others, the pressure might increase. Full transparency would mean that anybody can control what have been done. That might be good for results, but will be more stress for the scientists themselves.
Furthermore, nobody says that on a data publication, on a data paper, on the model paper and the journal paper always the same authors can be found. This might result into an even higher specialisation of the scientists, with positive and negative consequences. First of all would the staff, which do more technical stuff get now the opportunity for scientific acknowledgement of their work. They might be not anymore hidden on the 20th place of the journal paper, but may have a lead or at least a much better place on the other publication forms. This of cause would only work, when all forms seen as equal, which will need a lot of time. Problematic in this case is, that some will use it as an excuse not to get into every detail of their work, since they are only responsible of their own work. Science need generalists as well as specialists. Nevertheless, further specialisation is the trend of the past century and have brought a lot of advantages to science. These possible consequences has to be kept in mind by everybody who asks for more specialised forms of publication, since they will definitely change science for the good or worse.