The new paper shows that data publishing, how it was developing in the past years has achieved a lot and built a good basis. But it also shows that there are issues not yet resolved, especially when it comes to databases. In this post I would like to elaborate on this a bit and show how this can be achieved in the future.
There are many ways to publish datasets nowadays. There exist the classical form of just making data available on a webpage or the formal publishing of datasets. The latter is of course a better option, which I had explored in past papers. Coming from this it is simple to assume that data publishing is solved. But the new paper shows that there are still open threads when it comes to database publishing.
A database is a collection of many different data entries. It is not uniquely describe-able or consist of the same format for each entry. Also when users cite a database, they probably just want to cite a single entry. As such, the DOI standard, which is currently used for citation of the data, has shortcomings in this as it has no standard of a separator tag, which would allow to subdivide the dataset in a standardised form. As such it would be discuss-able whether new developments are required here. For example it might be desireable to just cite the entries 1 to 5 of a database and ignore all the others. Currently it stops at this point as the DOI is not designed to cover this (it only allows to cite the whole dataset). This might be right by design, but will post problems for huge databases, where a second identifier has to be used at this point to really specify just the entries of interest. Of course every hosting institution might be able to introduce such measures for their own datasets, but as always is standardisation the key of success, and for this the DOI would require further developments.
Further problems occur when datasets are changing over time. The current standard requires that after an iteration time a new identifier has to be generated. This requires proper version control and much care with what data at what time. This get even more important in complex databases with different kind of entries. How this can be done in a simpler way in the future, perhaps with an automatic inclusion of a version into the citation identifier might be future research. Again, this could be done by each institution separately, but do we really want that? Wouldn’t it be better to introduce standards?
All in all is the infrastructure of data publishing on a good way. Of course there is still the issue where data is hosted under the requirements for long-term publishing, but this is like so much in science depending on funding models. Some special points emerge for database publishing, like the one I will talk about in another post for the citation of databases in more detail.