Post-processing paper background: Do we need new approaches in verification?

In the final post on this background-series I want to write about the necessity for new ideas in verification. Verification is essential in geo- and climate science, as it gives validity to our work of predicting the future, whether it is on the short or long timescale. Especially in long-term prediction we have the huge challenge to verify our predictions on a low number of cases. We are happy when we got our 30+ events to identify our skill, but we have to find ways to make quality statements on potentially much lower number of cases. When we e.g. investigate El Niño events over the satellite period, we might have a time series bellow 10 time steps at hand and come to a dead end with classical verification techniques. Contingency tables require much more cases, because otherwise potential uncertainties become so huge that they cannot be controlled. Correlation measures are also highly dependent on many cases. Everything below 30 is not really acceptable, which is shown by quite high thresholds to reach significance. Still, most of long term prediction evaluation rely on such methods.

An alternative idea has been proposed by DelSole and Tippett, which I had first seen at the S2S2D-Conference in 2018. In this case we do not investigate a whole time series at once, as we would do for correlations, but single events. This allows to evaluate the effect of every single time step on the verification and give therefore new information beside the information on the whole time series.

I have shown in the new paper, that this approach allows also a paradigm shift in evaluating forecasts. While we looked beforehand in many approaches at a situation, where the evaluation of a year depends on the evaluation on other years, by counting the successes of each single year makes a prediction evaluation much more valuable. We do often not ask how good a forecast is, but whether it is better than another forecast. And we want to know at the time of forecasting, how likely it is that a forecast is better than another. But this information is not given by many standard verification techniques, as they take into account the value of difference between two forecasts at each time step. This is certainly important information, but limits our view in essential questions of our evaluation. Theoretically, it is often possible, that one single year can decide whether one forecast is better than another. Or more extreme: When in correlation one forecast is really bad in one year, but is better in all other years, it can still be dominated by the other forecast. These consequences have to be taken into account when we verify our models with these techniques.

As such, it is important to collect new ideas about how we want to verify and quantify the quality with its uncertainties of the new challenges, which are posed to us. This new paper applies new approaches in many of these departments, but there is certainly quite some room for new ideas in this important field for the future.

Post-processing paper background: EMD and IQD? What is it about?

When you have two probability distributions and want to know the difference between them, then you need a way to measure it. Over the years many metrics and distance measures have been developed and used, the most famous one is the Kullback-Leibler-Distance. In a paper in 2012 I had shown that a metric called Earth Mover’s Distance (EMD) shows considerable improvements in detecting differences between distributions. So it was a natural idea for me to try to make use of this measure, when we want to compare two distributions.

So given is a distribution by the model prediction, defined by the ensemble members, and an observation with a non-parametric distribution of its uncertainties. A nowadays standard tool for evaluation of ensemble prediction is CRPS. In this case it is evaluated at which percentile of the probability distribution the deterministic observation can be found. The paper now tries to make use of this tool and extends it by looking at uncertain observations. So effectively, what is done is to measure the distance between two distributions and by normalising it against a reference (e. g. the climate state) a metric distinguishing between a good and a bad prediction can be created.

So how does the EMD work? Well, it effectively measures how much work would be needed to transfer one distribution into another. So when you imagine a distribution as a sand pile, then it measures the minimal amount of fuel a machine would need to push the sand around until it creates the target distribution. This picture is also the one from which the EMD got its name. As a metric it measures the distance precisely and therefore allows to say, when you have two predictions, which one is closer to the observations.

But it is important here to mention, that there are problems with this view. Similar to CRPS, there exist literature, which describe that even with its properties, measures like EMD are potentially to kind to false sharp predictions compared to uniformed ones. In the CRPS case, the distance is squared, so that a longer transport of probability is necessary for a wrong prediction. In my paper I also show the results with this approach as IQD. A squared distance is much less intuitive than a linear one, it is harder to understand for scientists, why they should use this above the others, which leads to hesitant use of these kind of measures. Therefore, it will be necessary in the future to much better describe why the issues occur and develop new pictures to explain everyone, why squaring is the way to go. We also need new ways in general for verification in the future, but on this I will write more on the final post in this series.

Post-processing paper background: Why do we need verification with uncertain observations?

Data verification is one of the corner stones in geoscience. Without knowing whether a prediction has been correct, it is not possible to claim that we can predict anything at all. Most of the verification bases nowadays on the assumption that observations are perfect, often without the acknowledgement of any uncertainties. Standard tools like contingency tables and correlations (the latter often used in some form in long-term predictions) makes it hard to take them into account (even when possible e. g. by sampling strategies).

Another problem is that having uncertainties for observations to work with is often not an easy task. An example are reanalysis data, which have long been only provided in form of one realisation. This led to the problem that while predictions were often available as ensembles, the observations to compare to were not. There are techniques available to use aggregated data and validate statistics of them, but the verification of most classical variables is still often done with certain observations. Currently the field is changing. Reanalysis start to become available in form of ensembles, so in the future we need new tools making use of these developments.

But also on the philosophical side there is more need to look into verification with uncertain observations. We know that the real world is not deterministic, we know that our instruments are imperfect and we are sure that these uncertainties matter. Why do we train our students in creating and  measuring uncertainties, when we later on do not use them in our analysis? And yes, there is the issue that all observations are in their core models. We acknowledge that models are imperfect, otherwise we wouldn’t need ensembles for creating predictions. But why do we then not take care of the uncertainties due to the applications in those models when we create observations. Those models are certainly not much better (they are just applied on a different temporal and spatial scale. So we have to confront this issue in every step we take, we do that in data assimilation, so we have to do it in data verification as well.

Therefore, new developments in this field are essential. We need new tools to look into uncertain observations and make use of them. This paper is a small step into opening opportunities for future developments in this direction. It is certainly not a final solution and certainly not the first step. It is just another proposal of a tool to approach this challenge. We require in the future well understood and tested tools, which are applicable by the broader scientific community. How those might look like is currently open, also whether the tools presented here are of any wider use. In the paper I described two metrics, the EMD and the IQD, and developed a strategy to make verification tools with them. In the next post I will take a deeper look into the two metrics and shine a light on the opportunity they offer.

Post-processing paper background: Why does sub-sampling work?

One main aspect of the new paper is the question why sub-sampling works. In many review rounds for the original paper (Dobrynin et al 2018) we got questions about a proper statistical model of the method and many claims why it should not work while it does (aka cheating). This is the point this manuscript comes into play. Instead of selecting a (probably) random number of ensemble members close to one or more predictors everything is transferred to distribution functions (pdf). Of course those are not easily available without making large amounts of assumptions, so I have gone the hard way. Bootstrapping of EOF fields is certainly no easy task in terms of computational costs, but it does work. It allows to have for every ensemble member and every predictor as well as for the observations of the North Atlantic Oscillation (NAO) a pdf.

Basing on those pdfs it is now possible to look for the reason of better prediction skill of the sub-sampling method compared of no-sub-sampling-case. First step is to show that the distribution view and the sub-sampling are at least similar. In the end, making use of pdfs is not a pure selection but more a weighting. It weights those ensemble members higher, which are close to a predictor compared to those far away. Of course there are differences between the two approaches, but the results are remarkably similar. It gave us more confidence that in the many tests we did in the past on the sub-sampling methodology the way how we select does not have such a huge influence (but that will be explained in detail in an upcoming paper). Consequently, we can accept that when we can show how the pdf-approach works we will get insights into the sub-sampling approach itself.

The new paper shows, that key to the understanding of the mechanism is the understanding of the spread. While seasonal prediction has an acceptable correlation skill for its mean of ensemble members, each prediction of a single ensemble member is rubbish. In consequence, the overall ensemble has a huge spread of quite uniformed members. We have learned in the past to work with such problems, requiring us to take huge care in how to evaluate predictions on the long-term timescale. By filtering this broad spread and with it highly variant distribution function with informed and sharper predictor functions leads to the effect of sharpening the combined prediction, while at the same time having a better prediction overall. With other (simplified) words: we weight down the influence of those ensemble members that drifted away from the correct path and concentrate onto those, which are consistent with the overall state of the climate system.

As a consequence, the nature of the resulting prediction is in its properties quite similar to a statistical prediction, but has still many advantages of a dynamical prediction. It is probably not the best of both worlds, but an acceptable compromise. But to establish that we need tools to evaluate the made predictions and that proved to be harder than expected. But that is the story of the next post on why we need verification tools for uncertain observations.

Post-processing paper background: What is sub-sampling?

The idea behind sub-sampling is that dynamical ensemble prediction on long-term time scales have a too large spread. To counter that a couple of years ago (Dobrynin et al 2018) we introduced a technique called sub-sampling, which combines statistical with dynamical predictions. To understand the post-processing paper and its intentions it is key to understand at least in its basics the sub-sampling procedure, as it is in essence a generalisation of the methodology.

So how does it work. First of all we need a dynamical model, which predicts our chosen phenomena. It does not necessary have to have skill in it on the chosen time frame, but that is something I will discuss when another paper currently in review will be published. As we use in our papers the North Atlantic Oscillation (NAO) in Winter, let’s take this as an example. So in that case the NAO to be predicted is the NAO over December, January and February (DJF). The predictions are made at the beginning of November and show a reasonable prediction skill when measured with correlation measures, but have a large spread. At that point we introduce the statistical prediction. For this we have to have physically motivated predictors. For example is the sea-surface temperature in parts of the North Atlantic in September or October well connected to the NAO in DJF. Meaning: A high temperature in those areas in autumn will by some chance lead to a high NAO value in DJF and the other way round. Consequently: When we choose those areas and taking a normalised mean over their autumn SST values, we generate a predictor of the NAO in the following winter. Same is true for other variables, like sea ice or stratospheric temperature, where literature in the past has proven the connections. It is essential that we can trust the connections, as their validities are important when we want to trust the final predictions.

Having several predictor values for a prediction of the DJF-NAO allows us now to select those ensemble members of the dynamical prediction, which are close to at least one of the predictor. In the published paper we had chosen those 10 ensemble members, which are the closest to one of the used predictors, which lead to a minimum of 10 ensemble members (when all predictors select the same ensemble members) or all ensemble members of the 30 member ensemble, when the predictors deliver a widely spread prediction. Taking the mean over those selected (or better sub-selected) ensemble members has proven to have much more predictive skill than the ensemble mean with all ensemble members.

An advantage of the approach is that we have now not only a better prediction for the NAO in DJF, but also for other variables. Due to the fact that we are not only choosing NAO values with the help of the statistical predictors, but the full model fields, variables connected to the NAO in DJF have also the chance to be higher predictable. All this allows us to make a better prediction for the chosen phenomena (DJF-NAO) as well as the whole dynamical fields of different variables in selected areas.

As such it is a powerful tool and has been proven in other applications with different modifications as very stable. But there remained several questions in the review processes, which were up to now unanswered. The most important one is why it works. Others have looked in the last years on the physical argumentation, while the new paper investigated the statistical argumentation. This will be further investigated in the next background post.