EGU 2019: And here comes the rain

The third day of the EGU this year saw a change in weather. Instead of the well loved sunny Vienna spring days, the next few days will show us their grey and rainy side. Anyway, it was also a day which was not so overwhelmingly filled with interesting talks, so I used it more for meeting people.

It started in the morning with a splinter meeting on sea-level databases. A small group had a lively discussion on creating the future of databases in this field and it was fun to see what might happen here in the future. After that I had some talks with people on future projects and visited a few talks on ice sheets in Antarctica and Greenland.

The rest of the day I mainly visited the poster halls and in between watched a medal lecture on astronomical cycles. As there was no session of real interest for me left for the last slot (and if it was, I guessed I wouldn’t have fit into the room anyway) I called it a early days end. An EGU day without sun is quite different than with it. Usually everybody leaves the centre for lunch to have a picnic in the park, but with all the rainy intervals, everybody looked for a place inside the conference centre. As this might be repeated for the rest of the days, the EGU this year will have a completely different character than the last ones I visited. Let’s hope for the best. Tomorrow will be my talk and it will be certainly dominated by seasonal prediction.


Database paper background: What can the future bring?

After the new paper got out the question remain what the future of palaeo sea-level and ice-sheet databases will look like. As this can be a wide-ranging topic, I would like to start with what I think will be the next developments, before talking about possible long-term aspirations.

In the current situation, the best we hope for is that new site specific databases are getting better and better. Steps to this might be small, but new papers making use of databases will demonstrate that a better understanding of the data is necessary. To understand the data better, it is required to create more detailed databases, which consequently include uncertainty estimates and the limits of the data generation. One of these papers, written by me, is already accepted and will be out in the next days. Further ones by me and others will follow and demonstrate the need for these well prepared databases.

Coming from this we can think about the future of general databases. So is it possible to create one access point for all available palaeo sea-level/ice-sheet data for a given time period? My answer is yes. It is possible, but many obstacles are still in the way. The most important one is long-term financing. To fulfil the ATTAC^3 guidelines it is essential that it can be guaranteed that data will be available in the long-term. The next step would be the setup of a trusted group of experts, which have the background to scientifically back decisions, which are required within the creation of large databases. Only when these two corner stones would exist starting of the technical development would be reasonable.

Nevertheless, many critical points will come up. Is everybody allowed to contribute, or is only data published in journals suitable for such a database? What will be the data formats for exchange? How will the technical implementation guarantee the future suitability of such a database? And how can trust be built up? There are many problems in creating such a database and so I do not think that we will be there in the upcoming five years. Basically the funding problem is too problematic and will suppress possible advance in the field even longer. Up to then the combination of many different data sources will remain an issue and will hopefully not lead to too many wrong scientific results caused by bad raw data interpretation.

Database paper background: Data citation and the problem with the H-index

The new paper covers many different topics about data publishing. One point of this is how to cite data properly. A last post has shown that there are some points, which are special for data publishing. How databases can be cited is still a problem and I try in this post to highlight them.

Citing data was developed in the past years on the basis of the unique identifier DOI. It allows to reach a dataset in a fixed form, which is identical to the one, which was published by the data authors. It is more or less guaranteed that metadata is available and so that it can be used for example in journal papers to directly cite the dataset instead of a journal paper which has used it. So far the developments are quite positive and it seems to be possible that one day data publications reaches the same acceptance as journal papers.

Basis of this is the creation of credit for the original data authors. They invest more time in the metadata and the documentation of the dataset and for this they get something measurable, which helps them to reach acceptance in the scientific community. Basically, this it how it works with journal articles and their citation and this how ideally it might work for datasets. I have shown in the past that there are ways to make this happen, by including peer review into the data publication process. Nevertheless, databases pose completely different problems and are so risking this framework to work properly.

To understand that I have to explain the most common used metric for citations, and with this the basis for giving credit to another scientist. The currency in science are citations of others researchers research output. To make the quality of one researcher measurable with just one number, many different approaches have been introduced, but the most common one is the H-index. It basically works like this: The number represents the number of articles, which have the number of citations equal to that number. So when you have five articles, which have at least five citations, but not six articles with at least six citations, you got an H-Index of five. The aim is to bring researchers to a point where they produce not just a few articles with high recognition, but many articles with at least some recognition.

So it is really about getting as many citations for as many different research outputs as possible. The criticism on this goes often in the way that it does not account which place you have in the author list or that it is unfairly biased towards a large width of at least acceptable articles instead of few exceptional ones. So it is a lot about its interpretation.

Now we come to databases, which ideally collect data from many different data publications. Of course, we assume that the database would cite these original publications and so fulfils best practice. So each of the originals might get one citation and that is fine. The problem comes now into the play when new publications come out basing on the data from the database. It would probably then cite the database as its source and so the authors of it gets the credit they deserve. Unfortunately, the original data publications do not get cited in this case, because they are not the direct source anymore. So anyone who did the work in the field, they have to fear that with the introduction of a database covering their work, they lose all the future deserved credit for their work.

How can we assume that under this condition scientist will be happy to provide data, when they do not get anything anymore for it afterwards? This system would be unfair and it basically comes down to the flaws in the H-Index. When it would not just count the direct citations, but also indirect “second level” citations, it would be hand-able. I have certainly ideas how to calculate better indices, unfortunately I do not have the data for a case study. But unless we do not overcome this issue, database creation poses a risk to the currently established ways to pay its scientists with the recognised credit for their good work.  I am certainly not the first one who raises this issue, but I have to re-stress it again and again: As such, the H-index is not an appropriate measure to apply to research datasets.

Database paper background: Data publishing? Isn’t that a solved issue?

The new paper shows that data publishing, how it was developing in the past years has achieved a lot and built a good basis. But it also shows that there are issues not yet resolved, especially when it comes to databases. In this post I would like to elaborate on this a bit and show how this can be achieved in the future. Continue reading

Database paper background: What makes palaeo sea-level and ice-sheet data so special?

The new paper is about palaeo data and more specific about sea-level and ice-sheet data. As many guidances given in this paper are applicable for many different types of datasets, these data require special care. In this post I will talk about this and explain why the effort of good data handling is so essential in this field. Continue reading