When you program in science, your projects usually progress over time. Often, you got an idea, you create a quick and dirty solution and test it on data you know. This works for a while, but after several amendments, future-proving and incorporating new ideas, the code gets unbearable. This is the point when bottom-up-approaches break down and when you think about reprogramming everything. In these cases the new programs are not anymore bottom-up, you have an idea in mind what to achieve and often reuse some code snippets from before. We have reached the world of top-down.
During my PhD I worked on Quality Assurance of Environmental Data and how to exchange quality information between scientists. I developed a concept for a possible workflow, which would help all scientists, data creators and re-users, for making data publications much more useful. One major foundation of this were quality tests, which I either taken from existing literature or developed anew.
Part of this work was the development of a proof-of-concept implementation of the methodologies. I used R, which is my prime language for quite a while, to design an as much as possible automisable test workflow. It was quite complex and in retrospect a bit too ambitious for real world applications. Anyway, as I prefer open science, I published it as an extension package for R in 2011: qat – Quality Assurance Toolkit.
The publication process was more challenging as anticipated. For each function, and my package had more than a hundred, a detailed help file was requested, which cost me at that time quite a while to create. I also wanted to add additional information, like an instruction manual, so that at least in theory it would have been possible to use the full functionality (like automatic plotting and saving of the test results) could be understood. Finally, when it was uploaded, I was happy and extended it until my PhD project came to an end.
Unfortunately, with this the work on the package has not stopped. R as a language is constantly changing, not really on the day-to-day tools, but in the background of the packages. New requirements come up now and then, usually associated with a deadline for package maintainers. What is quite simple to solve for small packages, can be a real challenge for complex ones like mine. I had to eliminate my instruction manual when the vignette system changed and created a dedicated website to have it still accessible. Also I had to replace packages I depend on, which is usually associated with quite a bit of change in the code.
All these changes are doable, but the big problems start with the requirement, that a newly uploaded package has to fulfil the current norms of the R packages. A package, which was fine a few months earlier has to change dramatically with the next update. This leads usually to a time problem, as each update needs therewith several days. So minor changes to the original code lead to a heavy workload. This lead to the situation, that I was not able to update it on time when the last deadline turned up and so my package went to archive. Half a year later I found some time and have now brought it back up to the CRAN network.
All in all, this workload is keeping me off to create new R packages. Making them would be feasible, but maintaining them is a pain. With these constant policy changing measures, R gets more and more out of fashion for heavy users and with it, it is in danger to lose out compared to other languages like python in teaching for the next generation of scientists. My personal hope is that future development will lead to a more stable policy on the package policy within R, so that more packages will be available also for the future. As things stand, I am happy to have my package up again, but when the next deadline will enter my mailbox, I will again have to evaluate the threatening workload, before I can afford to schedule a new release.
Last week several journals have published an agreement made on an National Insurance in Health (NIH) workshop in June 2014. It focus on preclinical trials, but allows a wider view on the development of the publication of research in general. Furthermore, large journals, like Science and Nature have accompanied this with further remarks on their view on the future of proper documentation of scientific research, which head into the direction I named “Open methods, open data, open models.” a while ago. In this post I would like to comment the agreement and some reactions from these major journals.
For a few years now, Data Science is a hot topic. Under the theme ‘Big Data’, it got popular and when you believe some media it will solve nearly all problems in the world. But what does it mean to be a data scientist? Is it a jack of all trades or just someone, who know no field really well? As I would myself describe as a data scientist, I would like to write a little bit about how I see this field.
A few month ago I have written about modularisation and the importance of programming in science. To get a little bit deeper into the topic, I will explain some further concepts in programming. One of them, which is especially important in modularised programs, are variable containers. The basic idea is that all variables, which are exchanged between the modules, are stored in a own module, which is then included into every other module. Nevertheless, some precautions have to be taken to make this concept a success in larger programs.