Database paper background: Data citation and the problem with the H-index

The new paper covers many different topics about data publishing. One point of this is how to cite data properly. A last post has shown that there are some points, which are special for data publishing. How databases can be cited is still a problem and I try in this post to highlight them.

Citing data was developed in the past years on the basis of the unique identifier DOI. It allows to reach a dataset in a fixed form, which is identical to the one, which was published by the data authors. It is more or less guaranteed that metadata is available and so that it can be used for example in journal papers to directly cite the dataset instead of a journal paper which has used it. So far the developments are quite positive and it seems to be possible that one day data publications reaches the same acceptance as journal papers.

Basis of this is the creation of credit for the original data authors. They invest more time in the metadata and the documentation of the dataset and for this they get something measurable, which helps them to reach acceptance in the scientific community. Basically, this it how it works with journal articles and their citation and this how ideally it might work for datasets. I have shown in the past that there are ways to make this happen, by including peer review into the data publication process. Nevertheless, databases pose completely different problems and are so risking this framework to work properly.

To understand that I have to explain the most common used metric for citations, and with this the basis for giving credit to another scientist. The currency in science are citations of others researchers research output. To make the quality of one researcher measurable with just one number, many different approaches have been introduced, but the most common one is the H-index. It basically works like this: The number represents the number of articles, which have the number of citations equal to that number. So when you have five articles, which have at least five citations, but not six articles with at least six citations, you got an H-Index of five. The aim is to bring researchers to a point where they produce not just a few articles with high recognition, but many articles with at least some recognition.

So it is really about getting as many citations for as many different research outputs as possible. The criticism on this goes often in the way that it does not account which place you have in the author list or that it is unfairly biased towards a large width of at least acceptable articles instead of few exceptional ones. So it is a lot about its interpretation.

Now we come to databases, which ideally collect data from many different data publications. Of course, we assume that the database would cite these original publications and so fulfils best practice. So each of the originals might get one citation and that is fine. The problem comes now into the play when new publications come out basing on the data from the database. It would probably then cite the database as its source and so the authors of it gets the credit they deserve. Unfortunately, the original data publications do not get cited in this case, because they are not the direct source anymore. So anyone who did the work in the field, they have to fear that with the introduction of a database covering their work, they lose all the future deserved credit for their work.

How can we assume that under this condition scientist will be happy to provide data, when they do not get anything anymore for it afterwards? This system would be unfair and it basically comes down to the flaws in the H-Index. When it would not just count the direct citations, but also indirect “second level” citations, it would be hand-able. I have certainly ideas how to calculate better indices, unfortunately I do not have the data for a case study. But unless we do not overcome this issue, database creation poses a risk to the currently established ways to pay its scientists with the recognised credit for their good work.  I am certainly not the first one who raises this issue, but I have to re-stress it again and again: As such, the H-index is not an appropriate measure to apply to research datasets.


Database paper background: Data publishing? Isn’t that a solved issue?

The new paper shows that data publishing, how it was developing in the past years has achieved a lot and built a good basis. But it also shows that there are issues not yet resolved, especially when it comes to databases. In this post I would like to elaborate on this a bit and show how this can be achieved in the future. Continue reading

Database paper background: What does ATTAC^3 mean for scientific data handling?

The new paper proposes an ATTAC^3 sheme as guidance of data handling, which will also be of interest for other fields. To be open, collecting these points in this way was not my idea, but emerged during the writing process from one of the co-authors. I think it is a good basis for handling research data in general so I will explain it a bit from my personal view in the following.

Continue reading

Database paper background: What makes palaeo sea-level and ice-sheet data so special?

The new paper is about palaeo data and more specific about sea-level and ice-sheet data. As many guidances given in this paper are applicable for many different types of datasets, these data require special care. In this post I will talk about this and explain why the effort of good data handling is so essential in this field. Continue reading

Background to “Palaeo sea-level and ice-sheet databases: problems, strategies and perspectives”

This post is about the new paper, which got out this week by a group of authors, including me, on databases. The title of the paper is “Palaeo sea-level and ice-sheet databases: problems, strategies and perspectives” and was published in Climate of the Past. It bases on discussions and results of a PALSEA2 meeting in 2014 in Lochinver and was compiled by many leading scientists in this field.

The fundamental question this paper addresses is how to bring the information gathered in the field are brought to those using the data for their analysis. A medium for this are databases and many scientists create their own in many different ways. But how to make them reusable by other scientists in the best possible way is a huge task. As such this paper shows some general guidelines for the workflow of database creation and which steps should be taken in the future. It focusses therein on palaeo sea-level and ice-sheet data as they are used to investigate the development of the sea-level in the past thousands and millions of years.

This post should be an introduction to some other background posts, which I intend to write in the next weeks on further details of the paper. The topics I intend to write about are

  1. What makes palaeo sea-level and ice-sheet data so special?
  2. What does ATTAC^3 mean for scientific data handling?
  3. Data publishing? Isn’t that a solved issue?
  4. Data citation and the problem with the H-index
  5. What can the future bring?

With these topics I hope to enlight some more details of this topic from my personal view.