For a few years now, Data Science is a hot topic. Under the theme ‘Big Data’, it got popular and when you believe some media it will solve nearly all problems in the world. But what does it mean to be a data scientist? Is it a jack of all trades or just someone, who know no field really well? As I would myself describe as a data scientist, I would like to write a little bit about how I see this field.
For the start I would give you a video by John Rauser, which shows his view on data science:
The most important statement therein is that a data scientist is someone, who is well aware of both, statistics and engineering. Personally, I would not put engineering or programming as the main second quantity, but the knowledge of the field the scientist is working on. In Earth sciences, this is generally the application of physics in an Earth Science context. This can be very field specific (so oceanography, geophysics or meteorology), but usually for many tasks it is enough to know one field well and have learned something about the basics of the other. This does not mean that a data scientist is then able to solve all problems in the field, far from it, but that he is able to handle with general statistical tools the existing datasets and has the ability to interpret the results. Not as good as the expert, but good enough that the expert can make something out of it.
Programming is therein definitely a useful tool, as it allows not only to formulate the solution of problems by algorithms, but also to structure them. This is especially necessary, as many fields have their own way of handling things, their own vocabulary and their own nomenclature.
So I would describe a data scientist as someone, who can handle three things. First of all the statistical background, because evaluating data without statistics is nowadays not really a way forward anymore (at least, when you work in interdisciplinary environments). This gets especially important in large data applications. But also working on underdetermined problems needs this background to estimate not only the best value, but also how sure you can be about it. The second thing is the basic knowledge of the field. Statistics is subjective, there are always several different methods, which can be applied to the same problem. So it is important to know, which method you use and which consequences this choice has for the interpretation of the result. But for this you have to know your system well, something which you are not necessary able to do, when you are just fit in statistics. The third is then the programmer. Many statistical tools rely on the application of good algorithms. This could be sampling strategies or Bayesian statistics, all of them require an effective laboratory to perform well. To be able to create this is a basic requirement for every data scientist.
In general, data science has not really a good reputation. The problem is that especially the knowledge of the field is a critical element and often neglected. You can create and develop data science methods without it, but applying them correctly and interpreting the results in an accaptable manner is something completely different. So it definetely helps when you are able to understand the background of problems. As the idea of data science should not be to replace an existent field, but to add to it and helping those involved to better understand their research, it is a basic requirement to get used to its established philosophy and working style.
Data science will have a bright future, but it will also lead to many missunderstandings. Working in a proper scientific manner is elementary, and to accept that data do not tell you everything too. The so called gut feeling of sciensist is an important indicator of good solutions to problems. Using this information within an analysis in da data science context helps to get new results, and to create new ways of thinking about them.