Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Thursday, March 9, 2017

Data management for complete beginners

Bill Michener is a longtime advocate of data management and archiving practices for ecologists, and I was lucky to catch him giving talk on the topic this week. It clarified for me the value of formalizing data management plans for institutions and lab groups, but also the gap between recommendations for best practices in data management and the reality in many labs.

Michener started his talk with two contrasting points. First, we are currently deluged by data. There is more data available to scientists now than ever, perhaps 45000 exabytes by 2020. On the other hand, scientific data is constantly lost. The longer since a paper is published, the less likely its data can be recovered (one study he cited showed that data had a half life of 20 years). There are many causes of data loss, some technological, some due to changes in sharing and publishing norms. The rate at which data is lost may be declining though. We're in the middle of a paradigm shift in terms of how scientists see our data. Our vocabulary now includes concepts like 'open access', 'metadata', and 'data sharing'. Many related initiatives (e.g.  GenBank, Dryad, Github, GBIF) are fairly familiar to most ecologists. Journal policies increasingly ask for data to be deposited into publicly available repositories, computer code is increasingly submitted during the review process, and many funding agencies now require statements about data management practices.

This has produced huge changes in typical research workflows over the past 25 years. But data management practices have advanced so quickly there’s a danger that some researchers will begin to feel that it is unobtainable, due to the level of time, expertise, or effort involved. I feel like sometimes data management is presented as a series of unfamiliar tools and platforms (often changing) and this can make it seem hard to opt in. It’s important to emphasize good data management is possible without particular expertise, and in the absence of cutting edge practices and tools. What I liked about Michener's talk is that it presented practices as modular ('if you do nothing else, do this') and as incremental. Further, I think the message was that this paradigm shift is really about moving from a mindset in which data management is done posthoc ('I have a bunch of data, what should I do with it?') to considering how to treat data from the beginning of the research process.

Hierarchy of data management needs.

One you make it to 'Share and archive data', you can follow some of these great references.

Hart EM, Barmby P, LeBauer D, Michonneau F, Mount S, Mulrooney P, et al. (2016) Ten Simple Rules for Digital Data Storage. PLoS Comput Biol 12(10): e1005097. doi:10.1371/journal.pcbi.1005097

James A. Mills, et al. Archiving Primary Data: Solutions for Long-Term Studies, Trends in Ecology & Evolution, Volume 30, Issue 10, October 2015, Pages 581-589, ISSN 0169-5347.

https://software-carpentry.org//blog/2016/11/reproducibility-reading-list.html (lots of references on reproducibility)

K.A.S. Mislan, Jeffrey M. Heer, Ethan P. White, Elevating The Status of Code in Ecology, Trends in Ecology & Evolution, Volume 31, Issue 1, January 2016, Pages 4-7, ISSN 0169-5347.


Thanks to Matthias Grenié for discussion on this topic.

Thursday, August 13, 2015

#ESA100 The big-data era: ecological advances through data availability

Ecology is in a time of transition –from small-scale studies being the norm to large, global datasets employed to test broad generalities. Along with this ‘big data’ trend is the change in the ethical responsibility of scientists who receive public funds to share their data and ensure public access. As a result big online data repositories have been popping up everywhere.

One thing that I have been doing while listening to talks, or talking with people, is to make note of the use of large online databases. It is clear that the use of these types of data has become commonplace. So much so, that in a number of talks, the speakers simply referred to them by acronyms and we all understood what it was that they used. Here are examples of online data sources I heard referenced (and there are certainly many more):



 It seems difficult to keep track of all the different sources of available data, and these repositories differ in their openness to public access, with some requiring registration, permission requests, and the requirement to include data submitters as authors on publications. With Genbank as the gold standard for a data repository, it is inevitable that other types of ecological data will soon be required to be freely available. I've never figured out why genetic data has different accessibility expectations than, say, leaf trait data.

Despite the attractiveness of huge amounts of data available online, such data can only paint broad pictures of patterns in nature and cannot capture small scale variability very well (Simberloff 2006). We still require detailed experiments and trait measurements at small scales for things like within-species trait variability.

Ecology has grown, and will continue to do so as data is made available. Yet, the classic ecological field experiment will continue to be the mainstay for ecological advancement into the future.



Simberloff, D. (2006) Rejoinder to Simberloff (2006): don't calculate effect sizes; study ecological effects. Ecology Letters, 9, 921-922.

Thursday, April 24, 2014

Data merging: are we moving forward or dealing with Frankenstein's monster


I’m sitting in the Sydney airport waiting for my delayed flight –which gives me some time to ruminate about the mini-conference I am leaving. The conference, hosted by the Centre for Biodiversity Analysis (CBA) and CSIRO in Australia, on "Understanding biodiversity dynamics using diverse data sources", brought together several fascinating thinkers working on disparate areas including ecology, macroecology, evolution, genomics, and computer science. The goal of the conference was to see if merging different forms of data could lead to greater insights into biodiversity patterns and processes. 

Happy integration

On the surface, it seems uncontroversial to say that bringing together different forms of data really does promote new insights into nature. However, this only really works if the data we combine meaningfully complement one another. When researchers bring together data, there are under-appreciated risks, and the resulting effort could result in trying to combine data that make weird bedfellows.
Weird bedfellows

The risks include data that mismatch in the scale of observation, resulting in meaningful variation being missed. Data are often generated according to certain models with specific assumptions, and these data-generation steps can be misunderstood by end-users, resulting in inappropriate uses of data. Further, different data may be combined in standard statistical models, but the linkages between data types is much more subtle and nuanced, requiring alternative models.

Why these are issues stems from the fact that researchers now have an unprecedented access to numerous large data sets. Whether these are large trait data sets, spatial locations, spatial environmental data, genomes, or historical data, they are all built with specific underlying uses, limitations and assumptions.  

Regardless of these issues of concern, the opportunity and power to address new questions is greatly enhanced by multiple types of data. One thing I gained from this meeting is that there is a new world of biodiversity analysis and understanding emerging by smart people doing smart things with multiple data. We will soon live in a world where the data and analytical tools allow research to truly combine multiple processes to predict species' distributions, or to move from evolutionary events in deep history to modern day ecological patterns.