Database paper background: Data citation and the problem with the H-index

The new paper covers many different topics about data publishing. One point of this is how to cite data properly. A last post has shown that there are some points, which are special for data publishing. How databases can be cited is still a problem and I try in this post to highlight them.

Citing data was developed in the past years on the basis of the unique identifier DOI. It allows to reach a dataset in a fixed form, which is identical to the one, which was published by the data authors. It is more or less guaranteed that metadata is available and so that it can be used for example in journal papers to directly cite the dataset instead of a journal paper which has used it. So far the developments are quite positive and it seems to be possible that one day data publications reaches the same acceptance as journal papers.

Basis of this is the creation of credit for the original data authors. They invest more time in the metadata and the documentation of the dataset and for this they get something measurable, which helps them to reach acceptance in the scientific community. Basically, this it how it works with journal articles and their citation and this how ideally it might work for datasets. I have shown in the past that there are ways to make this happen, by including peer review into the data publication process. Nevertheless, databases pose completely different problems and are so risking this framework to work properly.

To understand that I have to explain the most common used metric for citations, and with this the basis for giving credit to another scientist. The currency in science are citations of others researchers research output. To make the quality of one researcher measurable with just one number, many different approaches have been introduced, but the most common one is the H-index. It basically works like this: The number represents the number of articles, which have the number of citations equal to that number. So when you have five articles, which have at least five citations, but not six articles with at least six citations, you got an H-Index of five. The aim is to bring researchers to a point where they produce not just a few articles with high recognition, but many articles with at least some recognition.

So it is really about getting as many citations for as many different research outputs as possible. The criticism on this goes often in the way that it does not account which place you have in the author list or that it is unfairly biased towards a large width of at least acceptable articles instead of few exceptional ones. So it is a lot about its interpretation.

Now we come to databases, which ideally collect data from many different data publications. Of course, we assume that the database would cite these original publications and so fulfils best practice. So each of the originals might get one citation and that is fine. The problem comes now into the play when new publications come out basing on the data from the database. It would probably then cite the database as its source and so the authors of it gets the credit they deserve. Unfortunately, the original data publications do not get cited in this case, because they are not the direct source anymore. So anyone who did the work in the field, they have to fear that with the introduction of a database covering their work, they lose all the future deserved credit for their work.

How can we assume that under this condition scientist will be happy to provide data, when they do not get anything anymore for it afterwards? This system would be unfair and it basically comes down to the flaws in the H-Index. When it would not just count the direct citations, but also indirect “second level” citations, it would be hand-able. I have certainly ideas how to calculate better indices, unfortunately I do not have the data for a case study. But unless we do not overcome this issue, database creation poses a risk to the currently established ways to pay its scientists with the recognised credit for their good work.¬† I am certainly not the first one who raises this issue, but I have to re-stress it again and again: As such, the H-index is not an appropriate measure to apply to research datasets.

Data peer review paper background: Chances for the future

The environment for the publication of data is currently changing rapidly. New data journals emerge, like Scientific Data from Nature two weeks ago or Geoscience Data Journal by Wiley. The latter was also in the focus of the PREPARDE project, which delivered a nice paper on data peer review a couple of weeks ago (Mayernik et al, 2014). Furthermore, more and more funding agency require the publication of data and it is to expect that this demand will lead to more pressure for scientists to make their work publicly available.

These developments are great, but at this point I would like to think further into the future. Where should we be in five or ten years, and what is possible in let’s say 30 or more years. A lot is the answer, but let’s go a little bit more in the details. Continue reading

Data peer review paper backbround: Statistical quality evaluation? What is that?

A basic point of the new paper is the introduction of quality evaluation. But what does this mean and why do I think it is important? Well, for the first question I have to talk a little bit about the background. The common words we use together with quality are assurance and control. Depending on their definition, they are focussing to make the product or the processes, which lead to the product, better. Since the products we are talking about is data, both are focussing to deliver better datasets.

Nevertheless, in peer review we are handling now a different stage, since we are now in the phase, in which we want to quantify the quality. To do this, some points have to be made clear. First is the fact that quality is subjective. Especially, when we think about the peer review process, it is important to keep in mind that this is not an objective process. The quality of the publication entity is defined by the opinion of the reviewers and editor and has therefore inevitably a personal touch. Of cause the same is true for data peer review. Continue reading

Data peer review paper background: The philosophical problematic of a missing data peer review

In philosophy, several great minds have addressed the way scientist should work to gain their knowledge. Among others Bacon (1620) and Popper (1934) showed different ways to gain information and how it can be evaluated to become science. During my PhD I developed a relatively simple and general working scheme for scientists, which was published in Quadt et al (2012). The paper analysed the way how this general scientific working scheme could be represented by scientific publications.

The way scietists should work (Quadt et al 2012)

The way scietists should work (Quadt et al 2012)

While the traditional journal paper, which exists since the Philosophical Transactions of the Royal Society, edited by Henry Oldenburg in 1665, covers the whole scientific process, new forms have emerged in the last decade. Data papers (Pfeiffenberger & Carlson, 2011), a short journal article focussing on the experimental design and present the data from the experiment, filled a gap and should simplify the use of data. Another process is the publication of data and metadata at a data centre itself, without an accompanying journal article.

This type of publication was part of my project at that time. A general question therein was how such a publication can be made comparable to the other types. The comparison showed that it is quite comparable, but that one important element is missing: peer review. Continue reading

Background to “Automated quality evaluation for a more effective data peer review”

The paper “Automated quality evaluation for a more effective data peer review“, which was published by me and my co-author in the Data Science Journal this week started as a common background theme for my PhD thesis. The task was to find a way to bring the loose chapters on quality tests together.

The basic idea was to take a closer look at the publication process in general and, since it was the topic of the project at that time, how it can be applied to data. This approach led to a lot of questions, especially on how scientist work, how they interact by their publications and how they should work. The latter is quite philosophical and was in part addressed in Quadt et al (2012).

In the upcoming week I want to give some insights into the general topic of the paper and how it tries to address the arisen problems. The topics are:

  1. The philosophical problematic of a missing data peer review
  2. How a data peer review could look like
  3. Statistical quality evaluation? What is that?
  4. Why quality is a decisive information for data
  5. Chances for the future

I hope these topics will show a little bit what is behind this paper and how it fits into the scientific landscape.

To really fully understand that paper it has to be brought into connection with Quadt et al 2012. In this paper we showed, that traditional publications and data publications can be published in a comparable way, but that for this one major element is missing: data peer review.

Continue reading

EGU 2014 – The last day

EGU 2014 statisticsThe last day of the EGU 2014 is done and it was again a quite interesting one. It started with a session on data publication, which gave a good overview on the current technical side of developments within this community. Since I had written my PhD on this topic, it was definetely a must see session for me. Additionally, my poster was placed in this session, which was presented in the following slot. Therein, I had some very interesting discussions about the necessity and potential consequences of data peer review.

After the lunch I paid a short visit to a nice verification talk before I took a walk over to the sea-level session. Therein several interesting talks, especially those focussing on statistics, generated a nice ending of a week of talks. The poster session at the end offered again some interesting points of discussion and with it ended the conference.

All in all it was a great week in Vienna. Like I had hoped a lot of interesting discussions emerged, I have seen a lot of interesting talks and posters and learned a lot. I am happy with the responses to my contributions and the wonderful weather was a great add on before I travel back to the UK. This was a week with a lot of ups and just a few downs and so I hope I will have the chance to be back in Vienna soon.