The new paper covers many different topics about data publishing. One point of this is how to cite data properly. A last post has shown that there are some points, which are special for data publishing. How databases can be cited is still a problem and I try in this post to highlight them.
Citing data was developed in the past years on the basis of the unique identifier DOI. It allows to reach a dataset in a fixed form, which is identical to the one, which was published by the data authors. It is more or less guaranteed that metadata is available and so that it can be used for example in journal papers to directly cite the dataset instead of a journal paper which has used it. So far the developments are quite positive and it seems to be possible that one day data publications reaches the same acceptance as journal papers.
Basis of this is the creation of credit for the original data authors. They invest more time in the metadata and the documentation of the dataset and for this they get something measurable, which helps them to reach acceptance in the scientific community. Basically, this it how it works with journal articles and their citation and this how ideally it might work for datasets. I have shown in the past that there are ways to make this happen, by including peer review into the data publication process. Nevertheless, databases pose completely different problems and are so risking this framework to work properly.
To understand that I have to explain the most common used metric for citations, and with this the basis for giving credit to another scientist. The currency in science are citations of others researchers research output. To make the quality of one researcher measurable with just one number, many different approaches have been introduced, but the most common one is the H-index. It basically works like this: The number represents the number of articles, which have the number of citations equal to that number. So when you have five articles, which have at least five citations, but not six articles with at least six citations, you got an H-Index of five. The aim is to bring researchers to a point where they produce not just a few articles with high recognition, but many articles with at least some recognition.
So it is really about getting as many citations for as many different research outputs as possible. The criticism on this goes often in the way that it does not account which place you have in the author list or that it is unfairly biased towards a large width of at least acceptable articles instead of few exceptional ones. So it is a lot about its interpretation.
Now we come to databases, which ideally collect data from many different data publications. Of course, we assume that the database would cite these original publications and so fulfils best practice. So each of the originals might get one citation and that is fine. The problem comes now into the play when new publications come out basing on the data from the database. It would probably then cite the database as its source and so the authors of it gets the credit they deserve. Unfortunately, the original data publications do not get cited in this case, because they are not the direct source anymore. So anyone who did the work in the field, they have to fear that with the introduction of a database covering their work, they lose all the future deserved credit for their work.
How can we assume that under this condition scientist will be happy to provide data, when they do not get anything anymore for it afterwards? This system would be unfair and it basically comes down to the flaws in the H-Index. When it would not just count the direct citations, but also indirect “second level” citations, it would be hand-able. I have certainly ideas how to calculate better indices, unfortunately I do not have the data for a case study. But unless we do not overcome this issue, database creation poses a risk to the currently established ways to pay its scientists with the recognised credit for their good work. I am certainly not the first one who raises this issue, but I have to re-stress it again and again: As such, the H-index is not an appropriate measure to apply to research datasets.