The new paper is about palaeo data and more specific about sea-level and ice-sheet data. As many guidances given in this paper are applicable for many different types of datasets, these data require special care. In this post I will talk about this and explain why the effort of good data handling is so essential in this field.
Palaeo sea-level and ice-sheet data are mainly geological based datasets, which are unique and expensive to recover. Ice-sheet data are often basing on rock analysis, while the sea-level points can be generated by many different procedures. Some base on corals, where references are made between corals found at a certain positions and the living requirements of their fellow specimen today. As the position (in the 3-dimensional space) is the information, which ultimately gives the sea-level height of this measurement point, it is essential to control that it has not been moved over time. Other index points base on drill holes and can have several points in succession. Or the analysis of archaeological features can also be a basis for new sea-level index points. The whole process of generating a single data point can differ between different types quite a bit and as such have to be well documented.
This documentation focusses on two key information: the age and the position. Both are for further analysis very important and have therefore to be treated with the same care. Re-users of the datasets are not necessary fully familiar with the methodologies used to generate these information. As such the procedure and assumptions which are used to get from the raw data to the final data point has to be documented within the databases in standardised and understandable formats. Also the data which is used to process the data, like tectonic corrections has to be simply accessible and transparently documented.
These two pieces of information both require an estimate of uncertainty. As for further analyses, the uncertainty can have much more importance than the actual value, any user of the database have to be able to exactly understand, how this uncertainty range was generated. Many implicit assumptions, like Gaussian uncertainties are common and valid, while other sub-parts of the uncertainty estimates might be non-Gaussian. As a data generator or database creator, who publish their data, never knows who might be using it one day, any tiny information additionally stored in a standardised way can be of huge importance for others. Storing these information is therefore a challenge and communicating what the stored information might imply even more.
This complex setting combined with the interdisciplinarity of the field, makes the design of databases a big challenge. Nevertheless, the full understanding of these information is required to use these databases later on to design an analysis of the global sea-level over time. As such it is worth the effort and the paper offers some starting points, how databases can be generally set up in the future to become even more successful in this task. A key point in this is standardisation, as only with this it is possible to generate databases, which are usable by their later users. Nevertheless, every data point is unique and the data generators might be interested in bringing as many as possible in as well. Doing so needs also happen in a standardised way so that the data is not only filling up the storage space, but can also be of use.
All in all is the variety of these data are a huge challenge for future databases, while the complex generation of each single data point requires the effort to bring any properly measured point available into them.