Massive ensemble paper background: What will the future bring?

In my final post on the background on the recently published paper, I would like to take a look into the future of this kind of research. Basically it highlights again what I have already written at different occasions, but putting it together in one post might make it more clear.

Palaeo-data on sea-level and its associated datasets are special in many regards. That is what I had written in my background post to the last paper and therefore several problems occur when these datasets are analysed. Therefore, as I have structured the problems into three fields within the paper I also like to do it here.

The datasets and their basic interpretation are the most dramatic point, where I expect the greatest steps forward in the next years. Some paper came out recently that highlight some problems, like the interpretation of coral datasets. We have to make steps forward to understand the combination of mixed datasets and this can only happen when future databases advance. This will be an interdisciplinary effort and so challenging for all involved.

The next field involved are the models. The analysis is currently done with simple models, which has its advantages and disadvantages. New developments are not expected immediately and so more the organisation of the development and sharing the results of the models will be a major issue in the imminent future. Also new ideas about the ice sheets and their simple modelling will be needed for similar approaches as we had used in this paper. Statistical modelling is fine up to a point, but there are shortcomings when it goes to the details.

The final field is the statistics. Handling sparse data with multidimensional, probably non-gaussian uncertainties has been shown as complicate. There needs to be new developments of statistical methodology, which are simple on the one side, so that every involved discipline can understand them, but also powerful enough to solve the problem. We tried in our paper the best to develop and use a new methodology to achieve that, but there are certainly different approaches possible. So creativity is needed to generate methodologies, which do not only deliver a value for the different interesting parameters, but also good and honest uncertainty estimates.

Only when these three fields develop further we can really expect to get forward with our insights into the sea-level of the last interglacial. It is not a development, which will happen quickly, but I am sure that the possible results are worth the efforts.

Advertisements

Background to “Palaeo sea-level and ice-sheet databases: problems, strategies and perspectives”

This post is about the new paper, which got out this week by a group of authors, including me, on databases. The title of the paper is “Palaeo sea-level and ice-sheet databases: problems, strategies and perspectives” and was published in Climate of the Past. It bases on discussions and results of a PALSEA2 meeting in 2014 in Lochinver and was compiled by many leading scientists in this field.

The fundamental question this paper addresses is how to bring the information gathered in the field are brought to those using the data for their analysis. A medium for this are databases and many scientists create their own in many different ways. But how to make them reusable by other scientists in the best possible way is a huge task. As such this paper shows some general guidelines for the workflow of database creation and which steps should be taken in the future. It focusses therein on palaeo sea-level and ice-sheet data as they are used to investigate the development of the sea-level in the past thousands and millions of years.

This post should be an introduction to some other background posts, which I intend to write in the next weeks on further details of the paper. The topics I intend to write about are

  1. What makes palaeo sea-level and ice-sheet data so special?
  2. What does ATTAC^3 mean for scientific data handling?
  3. Data publishing? Isn’t that a solved issue?
  4. Data citation and the problem with the H-index
  5. What can the future bring?

With these topics I hope to enlight some more details of this topic from my personal view.

Some comments on the Ocean Glider paper

To call it a new paper might be a little bit extragated, but its publication happend within the last year. Actually it was submitted around a year ago and published online in November, but the actual publication of the paper happend in April. The name is quite long, but tells you already a lot of its content:

Turbulence and Mixing by Internal Waves In The Celtic Sea Determined From Ocean Glider Microstructure Measurements

I do not want to talk about the whole paper, as my personal contribution was tiny compared to the great work of the other authors. Anyway, I would like to write a little bit about my part in it and what the task was.

Ocean gliders are one of the relatively new tools, which currently revolutionise the oceanographic observation system. As such they are currently tested for many applications, in case of this article for microstructure measurements. My part therein started when the main work was already done. After all the measuring, processing and calculations two time series over nearly nine days were given to me with the simple question: “What can you tell us about them.” Of course there were ideas around what could be in it, but as I do statistics, it is my task to make statements waterproof.

As always, you have to get familiar with the data in the first place before you can investigate detailed questions. I did quality assurance science during my PhD and from this I have my standard tools to play around with data and to learn about it. One of these tools is the histogram test, which is a nice test on inhomegeneities within datasets. The first thing you find with it is that there are obvious cycles within the time series, so you ask the experts to give you the obvious and physical most probable cycles you might find therin. Of course you can also tell exactly, which cycles have to be in it, by performing a spectral analysis, but when you make decsisions on simplifying and clustering data, it is better you understand the physics behind it. After doing this it was obvious that there are two different parts of the dataset (with different statistical properties), which are on the first view quite unrelated. The information to look at the data in the logarithmic sense, was then the main driver for the upcoming analysis.

When you assume distributions of the data it is important to test them. Done this it was simple to show that the time series are indeed, apart from the extremes, log-normal distributed. Performing the histogram test again, now with the logarithmic data, showed still the regime shift as before and so it was now the interesting question, whether the two parts itself were also log-normal distributed. Using qq-plots it was simple to show that they were and that just the mean and standard deviation in the logarithmic sense have changed. So my part got to an end, it was written up as one section at the end of the paper and I was happy with it.

So why are such analyses important? Why bringing in additional statistics into such a paper, while it is already a solid one? Because these small simple analysis contribute to the overall understandings of the data. Knowing the distribution of data values and its changes over time helps in modelling them or understanding the physics. Giving people simple tools at hand to see inhomogeneities would also allow for real time testing the data and might open new ways of measureing them. And yes, it gives nice figures, which illustrate the reader that there is really something within the data that might need further exploration. Statistics is not all about the equations, sometimes the right visualisation is equally important. All in all it was a nice example, how domain experts and their methodologies and a simple statistical analysis give quick and solid results.


M.R. Palmer, G.R. Stephenson, M.E. Inall, C. Balfour, A. Düsterhus, J.A.M. Green (2015): Turbulence and mixing by internal waves in the Celtic Sea determined from ocean glider microstructure measurements. Journal of Marine Systems, 144, 57-69

The role of statistics in science

Traditionally within the different disciplines of earth science the scientists are divided into two groups: modelers and observationalists. In this view the modellers are those who do theory, possibly with pen and paper alone, and the observationalist go into the field and get dirty hands. That this view is a little bit outdated, won’t be anything new. In my opinion, it really started with the establishment of remote sensing that this division reunited (Yes, reunite, because in the old days, there were a lot of scientists who did everything). As I am a learned meteorologist, from my view it is quite common that this division is not really existent anymore. Both types of scientists sit in front of their computer, both are programming and both have to write papers with a lot of mathematical equations. In other fields, the division might be still more obvious (e.g. Geology), but for many its only the type of data someone is working with, which classify someone as observationalist or modeller. Continue reading

Let’s play: HadCRUT 4

Playing around with data can be quite funny and sometimes deliver some interesting results. I had done this a lot in the past, which was mainly a necessity coming from my PhD. Therein I had developed some methods for quality assurance of data, which needed of cause some interesting applications. So every time a nice dataset got to live, I had run them through my methods and usually the results were quite boring. Main reason for this is that these methods are designed to identify inhomogeneities and a lot of the published data nowadays is already quality controlled (homogenised), which makes it quite hard to identify new properties within the dataset. Especially model data is often quite smoothed so that it is necessary to look at quite old data to find something really interesting. Continue reading

Data peer review paper background: Why quality is a dicisive information for data?

Using information on data quality is nothing new. A typical way to do it is by the uncertainty of the data, which gave different data points in many different data analysis methods something like a weighting. It is essential to know, which data points or parts can be believed and which are probably questionable. To help data reusers with this, a lot of datasets contain flags. They indicate when problems occurred during the measurement or when a quality control method raised doubts. Every scientist who analyses data has to look after this information, and is desperate to know whether they explain for him/her the reason, why for example some points do not fit into the analysis of the rest of the dataset.

By the institutionalised publication of data, the estimation of data quality gets to a new level. The reason behind this is that published data is not only used by the scientists, who are well aware of the specific field, but also by others. This interdisciplinary environment is a chance, but also a thread. The chances can be seen by new synergies, bringing new views to a field and even more the huge opportunities of new dataset combination. In opposite to this the risks are the possible misunderstandings and misinterpretations of datasets and the belief that published datasets are ideal. The risks can at best countered by a proper documentation of the datasets. Therefore is the aim of a data peer review to guarantee the technical quality (like readability) of the dataset a good documentation. This is even more important since the datasets itself should not be changed at all. Continue reading