There is an elephant in the room, at every conference in nearly every discipline. The elephant is so extraordinary that everyone seems to want to watch and hype it. In all this trouble a lot of common sense seems to get lost and especially the little mice, who are creeping around the corners, overlooked.
The big topic is Big Data, the next big thing that will revolutionise society, at least when you believe the advertisements. The topic grew in the past few years into something really big, especially as the opportunities of this term are regularly demonstrated by social media companies. Funding agencies and governments have seen this and put Big Data at their top of their science agenda. A consequence are masses of scientist, sitting in conference sessions about Big Data and discussions vary between the question on what it is and how it can be used. Nevertheless, there are a lot of traps in this field, who might have serious consequences for science in general.
First of all let us start with what Big Data is. There are nowadays various definitions floating around, some are good, some are bad. In my personal view, Big Data is the work on large (in number and/or in size), probably diverse datasets, on which the analyser do not necessary know their origin. I know that is a somehow different what you hear in many official definitions, but it matches the problem setting quite well. Outgoing from this, I want to discuss some of the problems, which come with this way of working.
I would like to start with the last part, the ignorance of the analyser on the source of the data. This part is not at all uncommon in science, especially when we talk about interdisciplinary work. Scientists know (ideally) their field and their data, but when they are confronted with datasets of other fields it might be more critical. Furthermore, large amount of data mean also problems for the quality control. A few data points and time series might be checkable in detail, but as soon as the amount rises, automatic methods have to be trusted. Sure, they are often better than humans in detecting problems, but it does not necessary mean that all issues are detected. When we talk about fields in time, it gets even more problematic, as this is generally not a well-developed field at all. So there is a huge risk that the basis for the analysis is not ideal and with it the uncertainty of every interpretation of a statistical analysis rises.
When we talk about statistics, we know that we have in different fields different statistical traditions, based on the usual available datasets. Meteorologists usually are used to large datasets, but they are keen on quality control and standardisation of the methodologies. You could say that they have learned their lesson, but at the same time it also limits their view for how to account for very diverse datasets. Yes, there are major developments in this field in context of the climatological analysis, but it still requires a thought whether all standard methods are really applicable to the datasets, which are non-homogeneous. Another field is the social sciences. Traditionally they had often a low number of questionnaires or similar data. Especially there, the huge amount of data coming up since the internet revolution in the past decade, seems to be a dream for every scientists. But it’s also a curse. Many statistical schemes and traditions are designed around a small number of data points. The traditional understanding of uncertainty comes with it and these methods will tell you that by simply taking more data points into account your uncertainty will most likely drop. But what might be true for small, homogeneous datasets, is not necessary true for really large ones. These require the development and application of new statistics and with it a new interpretation of the resulting uncertainties.
Furthermore, in many analyses several different datasets are compared and lead to conclusions on their connection. But only because two datasets have a correlation, they do not necessary have a causal relationship. You can account for it in your statistics, but even more you should know your data and the physical system described by it. This can be very complicate in Big Data and there is therefore a risk of over-interpretation of the results.
This is also true when we talk about verification and validation of results. Both require that datasets used to create a result and for the analysis are independent towards each other. What is controllable in small datasets, is hardly ever working with very large datasets of various sources. Theoretically you have to account for that, but as the analyser might not know the data well, he does not know how. Again, this leads to misinterpretation of uncertainties and a wrong assurance, where is none.
Another point is more a philosophical one. In the past century, especially physical sciences have used a paradigm, which bases on hypothesis and testing of them. The philosophical background of it is given for example by Popper and replaced the alternative by just looking at large amount of observations, detect patterns and draw conclusions from them. The latter was basically described by Bacon in the 17th century. Both views are still existing and followed, with different weight in different scientific communities. Big Data tends by design to favour the viewpoint of Bacon, which brings risks. Only because we have a large amount of data, it does not mean that we do not need to work clean with the scientific methods, and for many fields working from a hypothesis to a statistical answer by understanding the background behind the system is essential.
After all these critical words, I have to acknowledge, that there are of cause huge chances in the application of Big Data. Nevertheless, it requires many safeguards, to do the analysis right. First of all it requires an interdisciplinary approach. You need someone who knows the data very well, and when you as an analyser or statistician is not able to do it, someone has to explain it to you. This of cause requires an interdisciplinary working relationship, which always has its own problems (vocabulary, approach to uncertainty etc.). Furthermore, it requires a very clean working, starting theoretically from the creation of the datasets up to the final analysis. At every step the uncertainties have to be estimated and taken over to the next level.
As a consequence, the data has to be homogeneously quality assured, which requires great databases. Often it is only seen that there is a lot of data available and so you do only have to pay someone who analyses the data and you get cheap results. But this is not the case. Big Data requires Big Databases. And with big I do not mean necessary the size, but well controlled, designed with big effort and constantly maintained. The latter requires long-term funding and this is always a huge problem in science. But without good databases and therefore great data, every analysis will fail in terms of accuracy. It might assure you of great results, but as seen above, this might be just an illusion.
All in all, Big Data is a great chance, when it is done right. Nevertheless, I fear that many underestimate the complexity of the problems occurring and that due to this the high expectations might be disappointed on the long run. Yes, it is definitely worth trying, but in many fields there needs to be done the basic work first, which is standardisation of methods and vocabulary, creation of good databases and the openness to new interpretations of statistics and the resulting uncertainties. With this Big Data can be really the next big thing.