Database paper background: What can the future bring?

After the new paper got out the question remain what the future of palaeo sea-level and ice-sheet databases will look like. As this can be a wide-ranging topic, I would like to start with what I think will be the next developments, before talking about possible long-term aspirations.

In the current situation, the best we hope for is that new site specific databases are getting better and better. Steps to this might be small, but new papers making use of databases will demonstrate that a better understanding of the data is necessary. To understand the data better, it is required to create more detailed databases, which consequently include uncertainty estimates and the limits of the data generation. One of these papers, written by me, is already accepted and will be out in the next days. Further ones by me and others will follow and demonstrate the need for these well prepared databases.

Coming from this we can think about the future of general databases. So is it possible to create one access point for all available palaeo sea-level/ice-sheet data for a given time period? My answer is yes. It is possible, but many obstacles are still in the way. The most important one is long-term financing. To fulfil the ATTAC^3 guidelines it is essential that it can be guaranteed that data will be available in the long-term. The next step would be the setup of a trusted group of experts, which have the background to scientifically back decisions, which are required within the creation of large databases. Only when these two corner stones would exist starting of the technical development would be reasonable.

Nevertheless, many critical points will come up. Is everybody allowed to contribute, or is only data published in journals suitable for such a database? What will be the data formats for exchange? How will the technical implementation guarantee the future suitability of such a database? And how can trust be built up? There are many problems in creating such a database and so I do not think that we will be there in the upcoming five years. Basically the funding problem is too problematic and will suppress possible advance in the field even longer. Up to then the combination of many different data sources will remain an issue and will hopefully not lead to too many wrong scientific results caused by bad raw data interpretation.

Advertisements

Database paper background: Data citation and the problem with the H-index

The new paper covers many different topics about data publishing. One point of this is how to cite data properly. A last post has shown that there are some points, which are special for data publishing. How databases can be cited is still a problem and I try in this post to highlight them.

Citing data was developed in the past years on the basis of the unique identifier DOI. It allows to reach a dataset in a fixed form, which is identical to the one, which was published by the data authors. It is more or less guaranteed that metadata is available and so that it can be used for example in journal papers to directly cite the dataset instead of a journal paper which has used it. So far the developments are quite positive and it seems to be possible that one day data publications reaches the same acceptance as journal papers.

Basis of this is the creation of credit for the original data authors. They invest more time in the metadata and the documentation of the dataset and for this they get something measurable, which helps them to reach acceptance in the scientific community. Basically, this it how it works with journal articles and their citation and this how ideally it might work for datasets. I have shown in the past that there are ways to make this happen, by including peer review into the data publication process. Nevertheless, databases pose completely different problems and are so risking this framework to work properly.

To understand that I have to explain the most common used metric for citations, and with this the basis for giving credit to another scientist. The currency in science are citations of others researchers research output. To make the quality of one researcher measurable with just one number, many different approaches have been introduced, but the most common one is the H-index. It basically works like this: The number represents the number of articles, which have the number of citations equal to that number. So when you have five articles, which have at least five citations, but not six articles with at least six citations, you got an H-Index of five. The aim is to bring researchers to a point where they produce not just a few articles with high recognition, but many articles with at least some recognition.

So it is really about getting as many citations for as many different research outputs as possible. The criticism on this goes often in the way that it does not account which place you have in the author list or that it is unfairly biased towards a large width of at least acceptable articles instead of few exceptional ones. So it is a lot about its interpretation.

Now we come to databases, which ideally collect data from many different data publications. Of course, we assume that the database would cite these original publications and so fulfils best practice. So each of the originals might get one citation and that is fine. The problem comes now into the play when new publications come out basing on the data from the database. It would probably then cite the database as its source and so the authors of it gets the credit they deserve. Unfortunately, the original data publications do not get cited in this case, because they are not the direct source anymore. So anyone who did the work in the field, they have to fear that with the introduction of a database covering their work, they lose all the future deserved credit for their work.

How can we assume that under this condition scientist will be happy to provide data, when they do not get anything anymore for it afterwards? This system would be unfair and it basically comes down to the flaws in the H-Index. When it would not just count the direct citations, but also indirect “second level” citations, it would be hand-able. I have certainly ideas how to calculate better indices, unfortunately I do not have the data for a case study. But unless we do not overcome this issue, database creation poses a risk to the currently established ways to pay its scientists with the recognised credit for their good work.  I am certainly not the first one who raises this issue, but I have to re-stress it again and again: As such, the H-index is not an appropriate measure to apply to research datasets.

Database paper background: Data publishing? Isn’t that a solved issue?

The new paper shows that data publishing, how it was developing in the past years has achieved a lot and built a good basis. But it also shows that there are issues not yet resolved, especially when it comes to databases. In this post I would like to elaborate on this a bit and show how this can be achieved in the future. Continue reading

Database paper background: What makes palaeo sea-level and ice-sheet data so special?

The new paper is about palaeo data and more specific about sea-level and ice-sheet data. As many guidances given in this paper are applicable for many different types of datasets, these data require special care. In this post I will talk about this and explain why the effort of good data handling is so essential in this field. Continue reading

A stronger publication regime

Last week several journals have published an agreement made on an National Insurance in Health (NIH) workshop in June 2014. It focus on preclinical trials, but allows a wider view on the development of the publication of research in general. Furthermore, large journals, like Science and Nature have accompanied this with further remarks on their view on the future of proper documentation of scientific research, which head into the direction I named Open methods, open data, open models.” a while ago. In this post I would like to comment the agreement and some reactions from these major journals.

Continue reading

Data peer review paper background: Chances for the future

The environment for the publication of data is currently changing rapidly. New data journals emerge, like Scientific Data from Nature two weeks ago or Geoscience Data Journal by Wiley. The latter was also in the focus of the PREPARDE project, which delivered a nice paper on data peer review a couple of weeks ago (Mayernik et al, 2014). Furthermore, more and more funding agency require the publication of data and it is to expect that this demand will lead to more pressure for scientists to make their work publicly available.

These developments are great, but at this point I would like to think further into the future. Where should we be in five or ten years, and what is possible in let’s say 30 or more years. A lot is the answer, but let’s go a little bit more in the details. Continue reading

Data peer review paper background: The philosophical problematic of a missing data peer review

In philosophy, several great minds have addressed the way scientist should work to gain their knowledge. Among others Bacon (1620) and Popper (1934) showed different ways to gain information and how it can be evaluated to become science. During my PhD I developed a relatively simple and general working scheme for scientists, which was published in Quadt et al (2012). The paper analysed the way how this general scientific working scheme could be represented by scientific publications.

The way scietists should work (Quadt et al 2012)

The way scietists should work (Quadt et al 2012)

While the traditional journal paper, which exists since the Philosophical Transactions of the Royal Society, edited by Henry Oldenburg in 1665, covers the whole scientific process, new forms have emerged in the last decade. Data papers (Pfeiffenberger & Carlson, 2011), a short journal article focussing on the experimental design and present the data from the experiment, filled a gap and should simplify the use of data. Another process is the publication of data and metadata at a data centre itself, without an accompanying journal article.

This type of publication was part of my project at that time. A general question therein was how such a publication can be made comparable to the other types. The comparison showed that it is quite comparable, but that one important element is missing: peer review. Continue reading