The burden of maintaining an R-package

During my PhD I worked on Quality Assurance of Environmental Data and how to exchange quality information between scientists. I developed a concept for a possible workflow, which would help all scientists, data creators and re-users, for making data publications much more useful. One major foundation of this were quality tests, which I either taken from existing literature or developed anew.

Part of this work was the development of a proof-of-concept implementation of the methodologies. I used R, which is my prime language for quite a while, to design an as much as possible automisable test workflow. It was quite complex and in retrospect a bit too ambitious for real world applications. Anyway, as I prefer open science, I published it as an extension package for R in 2011: qat – Quality Assurance Toolkit.

The publication process was more challenging as anticipated. For each function, and my package had more than a hundred, a detailed help file was requested, which cost me at that time quite a while to create. I also wanted to add additional information, like an instruction manual, so that at least in theory it would have been possible to use the full functionality (like automatic plotting and saving of the test results) could be understood. Finally, when it was uploaded, I was happy and extended it until my PhD project came to an end.

Unfortunately, with this the work on the package has not stopped. R as a language is constantly changing, not really on the day-to-day tools, but in the background of the packages. New requirements come up now and then, usually associated with a deadline for package maintainers. What is quite simple to solve for small packages, can be a real challenge for complex ones like mine. I had to eliminate my instruction manual when the vignette system changed and created a dedicated website to have it still accessible. Also I had to replace packages I depend on, which is usually associated with quite a bit of change in the code.

All these changes are doable, but the big problems start with the requirement, that a newly uploaded package has to fulfil the current norms of the R packages. A package, which was fine a few months earlier has to change dramatically with the next update. This leads usually to a time problem, as each update needs therewith several days. So minor changes to the original code lead to a heavy workload. This lead to the situation, that I was not able to update it on time when the last deadline turned up and so my package went to archive. Half a year later I found some time and have now brought it back up to the CRAN network.

All in all, this workload is keeping me off to create new R packages. Making them would be feasible, but maintaining them is a pain. With these constant policy changing measures, R gets more and more out of fashion for heavy users and with it, it is in danger to lose out compared to other languages like python in teaching for the next generation of scientists. My personal hope is that future development will lead to a more stable policy on the package policy within R, so that more packages will be available also for the future. As things stand, I am happy to have my package up again, but when the next deadline will enter my mailbox, I will again have to evaluate the threatening workload, before I can afford to schedule a new release.

Massive ensemble paper background: What will the future bring?

In my final post on the background on the recently published paper, I would like to take a look into the future of this kind of research. Basically it highlights again what I have already written at different occasions, but putting it together in one post might make it more clear.

Palaeo-data on sea-level and its associated datasets are special in many regards. That is what I had written in my background post to the last paper and therefore several problems occur when these datasets are analysed. Therefore, as I have structured the problems into three fields within the paper I also like to do it here.

The datasets and their basic interpretation are the most dramatic point, where I expect the greatest steps forward in the next years. Some paper came out recently that highlight some problems, like the interpretation of coral datasets. We have to make steps forward to understand the combination of mixed datasets and this can only happen when future databases advance. This will be an interdisciplinary effort and so challenging for all involved.

The next field involved are the models. The analysis is currently done with simple models, which has its advantages and disadvantages. New developments are not expected immediately and so more the organisation of the development and sharing the results of the models will be a major issue in the imminent future. Also new ideas about the ice sheets and their simple modelling will be needed for similar approaches as we had used in this paper. Statistical modelling is fine up to a point, but there are shortcomings when it goes to the details.

The final field is the statistics. Handling sparse data with multidimensional, probably non-gaussian uncertainties has been shown as complicate. There needs to be new developments of statistical methodology, which are simple on the one side, so that every involved discipline can understand them, but also powerful enough to solve the problem. We tried in our paper the best to develop and use a new methodology to achieve that, but there are certainly different approaches possible. So creativity is needed to generate methodologies, which do not only deliver a value for the different interesting parameters, but also good and honest uncertainty estimates.

Only when these three fields develop further we can really expect to get forward with our insights into the sea-level of the last interglacial. It is not a development, which will happen quickly, but I am sure that the possible results are worth the efforts.

Massive ensemble paper background: What can we say now on the LIG sea-level?

After the new paper is out it is a good time to think about the current status on the main question it covered, the sea-level during the LIG. Usually I do not want to generalise too much in this field, as there is currently a lot going on, many papers are in preparation or have just been published and the paper we have just published was originally handed in one and a half years ago. Nevertheless, some comments on the current status might be of interest.

So the main question the most papers on this topic cover is: How high was the global mean sea-level during the last interglacial. There were some estimates in the past, but when you ask most people who work with this topic they will answer more than six metre  higher than today. That is of course an estimate with some uncertainty attached to it and currently most expect that it will not have been much higher than about nine metres than today. There are several reasons for this estimate, but at least we can say that we are quite sure that it was at least higher than present. From my understanding, geologists are quite certain that at least for some regions this is true and even when the data is sparse, meaning the number of data points low, it is very likely that this was also the case for the global mean. Whether it is 5, 6 or 10 metre higher is a more complicate question. It will still need more evaluation until we can make more certain statements.

Another question on this topic are the start point, end point and duration of the high stand. This question is very complex, as it depends on definitions and the problem that in many places only the highest point of sea-level over the duration of the LIG can be measured. That makes it very complex to say something definitive especially on the starting point. As such, our paper did not really made a statement on this, as it just shows that data from boreholes and from corals are currently not stating the same answer.

The last question everybody asks is the variability of the sea-level during the LIG. Was it just one big up and down or were there several phases with a glaciation phase in the middle. Or where there even more phases than two? Hard questions. The most reliable statements say that there are at least two phases, while from my perspective our paper shows that it is currently hard to make any statement basing on the data we used. But also here, new data might give us the chance to make better statements.

So there are still many questions to answer in this field and I hope the future, on which I will write in my last post on this topic, will bring many more insights into this field.

Massive ensemble paper background: Data assimilation with massive ensembles

Within the new paper we developed and modified a data assimilation scheme basing on simple models and up to a point Bayesian Statistics. In the last post I talked about the advantages and purposes of simple models and this time I would like to talk about their application.

As already talked about, we had a simple GIA model available, which was driven by a statistical ice sheet history creation process. From the literature, we had the guideline that the sea level over the past followed roughly the dO18 curve, but that high deviations from this in variation and values can be expected. As always in statistics there are several ways to perform a task, basing on different assumptions. To design a contrast to the existing literature, the focus was set to work with an ensemble based approach. Our main advantage here is that we get at the end individual realisations of the model run and can show individually how they perform compared to the observations.

The first step in this design process of the experiment is the question how to compare a model run to the observations. As there were several restrictions from the observational side (limited observations, large two-dimensional uncertainties etc.), we decided to combine Bayesian statistics with a sampling algorithm. The potential large number of outliers also required us to modify the classical Bayesian approach. As a consequence, we were able at that point to estimate for each realisation of a model run a probability.

In the following the experimental design was about a general strategy, how to create the different ensemble members so that they are not completely random. Even with the capability to be able to create a lot of runs, even realisations in the order of 10,000 runs are not sufficient to determine a result without a general strategy. This lead us to a modified form of a Sequential Importance Resampling Filter (SIRF). The SIRF uses a round base approach. In each round a number of model realisations are calculated (in our case 100) and afterwards evaluated. A predefined number of them (we used 10), the best performers of the round, are taken forward to the next and act as seeds for the new runs. As we wanted a time-wise determination of the sea-level, we chose the rounds in this dimension. Every couple of years (in important time phases like the LIG more often) a new round was started. In each the new ensembles branched from their seeds with anomaly time series for their future developments. Our setup required that we always calculate and evaluate full model runs. To prevent that very late observations drive our whole analysis, we restricted the number of observations taken into account for each round. All these procedures led to a system, where in every round, and with this at every time step of our analysis, the ensemble had the opportunity to choose new paths for the global ice sheets, deviating from the original dO18 curve.

As you can see above, there were many steps involved, which made the scheme quite complicate. It also demonstrate that standard statistics get to its limits here. Many assumptions are required, some simple and some tough ones, to generate a result. We tried to make these assumptions and our process as transparent as possible. As such, our individual realisations, basing on different model parameters and assumptions on the dO18 curve, show that it is hard to constrain the sea-level with the underlying datasets for the LIG. Of course we get a best ice-sheet history under our conditions, that is how our scheme is designed, but it is always important to evaluate whether the results we get out of our statistical analysis make sense (basically if assumptions hold). In our case we could say that there is a problem. It is hard to say whether it is the model, the observations or the statistics itself which make the largest bit of it, but the observations are the prime candidate. Reasons are shown in the paper together with much more information and discussions on the procedure and assumptions.

Massive ensemble paper background: Massive ensembles: How to make use of simple models?

The new paper on the LIG sea-level investigation with massive ensembles analyses simple models. In this post I want to talk a bit about their importance and how they can be used in scientific research.

Simple models are models with reduced complexity. In contrast to complex models their physics is simplified, they are more specified for a specific problem and their results are not necessarily directly comparable to the real world. They can have a smaller, easier to maintain code base, but also a simple model can grow in lines of codes fast. A simple model is defined depends on the processes it includes, not the mass of coding lines. Continue reading

IMSC2016: Final day

The fifth and last day of the 13th International Meeting on Statistical Climatology (IMSC) has ended and with it a great week here in the Rocky mountains. It started today with the first homogenisation session and the talks covered a wide range. Among this the worldwide organisation of climate data generation, the proposal of a new homogenisation methodology and finally an overview on future challenges for homogenisation. As I had myself worked during my PhD on quality control of data this topic is of special interest for me and I was happy to see this variety of talks in this field.

Low clouds

It was followed with a session on nonlinear methods. As it was the final day, the talks within the sessions covered a wider area, which was good. Finally the day ended for me again with a homogenisation session and as before, the talks were of high quality.

As it was the last day I would like to take a look back on the week. The weather was fantastic, apart from the last day, when the clouds and rain got in. The conference and many talks were really interesting. The mixture of so many different topics gave a great overview on the many flavours of statistical application in climate science. Many scientists, with different backgrounds, on various levels within their career led to a great knowledge exchange and new views on the topics. It was really well organised and so it was easy to concentrate on the good things of a conference. Therefore, the meeting was really worth a visit so perhaps again in three years at the next IMSC.

Massive ensemble paper background: Sea-level in the LIG: What are the problems?

In the new paper on the LIG sea-level investigation with massive ensembles, I try to demonstrate how complicate it is to actually model the LIG sea level. This has many reasons and are certainly not unique to this specific problem, but more to paleoclimatology in general. So I like to highlight a few specifics, which I encountered in the preparation of this paper.

I had written before on the speciality of the palaeo-sea-level data in general. From the statistical point of view the available data are inhomogeneous due to different origins and basing on different measurement principles (e.g. analysis of data from corals or boreholes). Handling their two-dimensional uncertainty (time and value), which are usually also quite large, makes it complicate to apply standard statistical procedures. Much to many assume that at least one of the two dimensions is neglectable and when problems with non-normal uncertainty curves are added, it poses a real challenge. And it is near to certain that the dataset contains outliers. There is also no clear way to identify these, so whether any value of a datapoint is valid is unclear.  Finally, we have to accept that there is hardly any real check to find out, whether those outliers identified by your statistical method are just false measurements or show special features of the physical system. And of course, a huge problem is that the number of available data is very low, which makes it even harder to constrain the sea-level during a specific time point.

Another point of concern is the combination of two complex systems which only in combination give a comparable result to the observations. There are on the one side the ice sheets. It is hard to put physical constraints on their spatial and temporal development (especially in simple models). We tried it with assumptions on their connections to the current (or better those of the past few thousand years) ice sheets, but it is hard to tell how the ice sheets at that time have really looked. Of course there exist complex model studies on this, but what studies we created here need are consistent ice sheets in a relatively high temporal resolution (e.g. 500 years) over a very long time (so more than 200.000 years). And additionally to it we would like to have several possible implementation of it (so I used 39.000 runs the majority of them have unique ice sheet histories). That is at least a challenge, so that statistical ice sheet creation becomes a necessity.

The other complex model is the earth. It reacts on the ice sheets and to their whole history. So the combination of these two models (the statistical ice sheet model and the physical GIA model) is key to a successful experiment. We handle here simple models, which always have their benefits, but also their huge disadvantage. I will talk on that more in the next post on this topic. But in this case these models are special. At least in theory they are non-Markovian, which means that not only the last state and the changes to the systems since then play a role, but also those system states longer ago. Furthermore, also future states play a role, but they have a much smaller influence. This has a lot to do with the experimental setup, but puts constraints on what you can do with your analysis procedures. It also requires that you have to analyse a very long time of development, in our case the last 214.000 years, even when you are just interested what happen at around 130.000 years before present.

Another factor in this are the so-called delta 18O curves. We use them to create a first guess of the ice sheets, which we afterwards varied. Nevertheless, their connection to ice volume is complicate. It is still open whether their connection is stable over time or changes during interglacials compared to glacials. Simple assumptions that they are constant make it complicate to handle a first guess, as it can be quite far off.

This all poses challenges to the methodological and experimental design. Of course there are other constraints like asvailable computing time and storage, which require you to make choices. I will certainly talk about some of them in the post about massive ensemble data assimilation.

So what makes the LIG-sea-level so complicate? It is the complexity of the problem and the low amount of constraints, due to sparsity and uncertainty of data. This combination poses a huge challenge to everyone trying to bring light into this interesting research field. From the point of statistics, it is an interesting problem and a real test to any statistical data assimilation procedure available.