After talking about the place for data publications among the other the scientific publication types, I want to give an overview on how a data publication might look like in the future. As I have stated before, to gain trust in a data peer review it should be comparable to the ones from other forms. The simplest way to achieve this is to build it up as similar as possible to this, but include changes which are necessary due to the form of the publication entity.
So let us take a look at the classical forms of peer review. Originally the author delivers a manuscript to a journal of his choice and the editor decides on the general whereabouts. Within this, the editor assigns one or more reviewers and ask for a review from them. Afterwards, the reviewers take a look at the manuscript, evaluate it, thinking about suggestions and in the end file a report. This report usually includes a statement on whether the manuscript should be accepted and whether revisions are necessary. These reports are afterwards evaluated by the editor, who decides on the future of this publication on the basis of the reviewer reports. He could either accept the manuscript, ask for revisions (with potentially bringing in again the former reviewers or new ones) or reject it. By accepting the manuscript it becomes a paper and is after publishing seen as part of science.
There are of cause many variation of this scheme, since every publisher is free to decide how he implements it. There are also many versions around, which are similar to a peer review, but are not seen as such. Open review is one of the most popular. In this the manuscript is published openly (e. g. on the web) and everybody is able to make statements on the entity. That this is not seen as peer review does not say that the system is not appropriate, but technically it does not lead to the scientific acknowledgement as the system described above. Anyway, it should not be mixed up with open peer review, where only the reviewer reports are made public.
When we want that data peer review looks similar to the traditional peer review scheme, it have to be stated that a lot of the procedure described above can be reused. First of all there is a data author, which have collected the publication entity, here the dataset with metadata. In a second step s/he is able to send it to a publisher, which is currently not a journal but an accepted repository like a data centre. There the editorial tasks are performed by their staff and they decide, like in an editorial review, what will be published or get at least a badge of good quality.
Coming from this it would be manageable to introduce a peer review (as stated by the current paper and also Lawrence et al (2011) made some moves into this direction). First of all should the author make a statement on the quality of its dataset. He should know it best and he should document his knowledge as far as possible. In a system we described in Quadt at al (2012) such a system was already included. As a consequence the dataset would gain a status as “approved by author”. The next level is then given by the data centre staff. They are well qualified to make statements on the technical quality and especially the quality of the metadata. In some cases they are also able to look on the content of the datasets, when they fall into their field of expertise. By doing this the dataset would get assigned a status as “checked by qualified staff”. This is the status most datasets currently have in the data centres, and it is in general enough to use them. When they should be really get a status as “peer reviewed”, independent reviewers are necessary. Bringing them it, letting them make statements and evaluate the statements made by the other parties in this system, would lead to such a status change.
Important therein is the fact, that the entity itself should not be changed by revisions: the data. Raw data is essential in our modern world of science and any change to it (like a data changing quality control) is a loss of information. Changes might be applicable to the metadata, flagging of data points might be possible, but the data values itself have to stay untouched. Nevertheless, it’s all about making people aware of the problems within the datasets by its documentation. Not more, but definitely also not less. We need proper datasets, which are well documented and come as direct from the instruments or the calculations as possible. This is not only true for the raw data (level 0), but also the primary data, which is something like a level 1 or 2, when we speak of data classification, which is used by the satellite community. Only the pure unmodified data helps to bring science forward. Any changes to this are critical and have to be very well justified and documented. Changed datasets have their own values and can be worth of publication itself. Nevertheless, the data has to be reproducable by the source data and this would be only possible when this is available in its original form.
As I have shown here, a general system is possible to establish, but the problem is still how to do this effectively. A tool for this could be quality evaluation, a technical instrument to gain information from the quality of a dataset by applying quality checks. How this works and what this is can be found in the next post.