Thursday, March 9, 2017

Data management for complete beginners

Bill Michener is a longtime advocate of data management and archiving practices for ecologists, and I was lucky to catch him giving talk on the topic this week. It clarified for me the value of formalizing data management plans for institutions and lab groups, but also the gap between recommendations for best practices in data management and the reality in many labs.

Michener started his talk with two contrasting points. First, we are currently deluged by data. There is more data available to scientists now than ever, perhaps 45000 exabytes by 2020. On the other hand, scientific data is constantly lost. The longer since a paper is published, the less likely its data can be recovered (one study he cited showed that data had a half life of 20 years). There are many causes of data loss, some technological, some due to changes in sharing and publishing norms. The rate at which data is lost may be declining though. We're in the middle of a paradigm shift in terms of how scientists see our data. Our vocabulary now includes concepts like 'open access', 'metadata', and 'data sharing'. Many related initiatives (e.g.  GenBank, Dryad, Github, GBIF) are fairly familiar to most ecologists. Journal policies increasingly ask for data to be deposited into publicly available repositories, computer code is increasingly submitted during the review process, and many funding agencies now require statements about data management practices.

This has produced huge changes in typical research workflows over the past 25 years. But data management practices have advanced so quickly there’s a danger that some researchers will begin to feel that it is unobtainable, due to the level of time, expertise, or effort involved. I feel like sometimes data management is presented as a series of unfamiliar tools and platforms (often changing) and this can make it seem hard to opt in. It’s important to emphasize good data management is possible without particular expertise, and in the absence of cutting edge practices and tools. What I liked about Michener's talk is that it presented practices as modular ('if you do nothing else, do this') and as incremental. Further, I think the message was that this paradigm shift is really about moving from a mindset in which data management is done posthoc ('I have a bunch of data, what should I do with it?') to considering how to treat data from the beginning of the research process.

Hierarchy of data management needs.

One you make it to 'Share and archive data', you can follow some of these great references.

Hart EM, Barmby P, LeBauer D, Michonneau F, Mount S, Mulrooney P, et al. (2016) Ten Simple Rules for Digital Data Storage. PLoS Comput Biol 12(10): e1005097. doi:10.1371/journal.pcbi.1005097

James A. Mills, et al. Archiving Primary Data: Solutions for Long-Term Studies, Trends in Ecology & Evolution, Volume 30, Issue 10, October 2015, Pages 581-589, ISSN 0169-5347.

https://software-carpentry.org//blog/2016/11/reproducibility-reading-list.html (lots of references on reproducibility)

K.A.S. Mislan, Jeffrey M. Heer, Ethan P. White, Elevating The Status of Code in Ecology, Trends in Ecology & Evolution, Volume 31, Issue 1, January 2016, Pages 4-7, ISSN 0169-5347.


Thanks to Matthias Grenié for discussion on this topic.

Monday, February 27, 2017

Archiving the genomes of all species

There is so much bad news about global biodiversity, that it is nice to hear about new undertakings and approaches. One of these is the 'Earth BioGenome Project' which proposes to sequence the genomes of the entirety of life on earth. Given that sequencing services have never been more affordable and more available to scientists, without question, though ambitious this is a feasible undertaking. Still, with perhaps 9 million eukaryotes on the planet, a rough prediction suggests it could take 10 years and several billion dollars to achieve.

The cost suggests a certain agony of choice - what is the best use of that amount of money (in the dream world where money can be freely moved between projects)? Direct application to conservation and management activities, or a catalog of diversity which may be the only way to save some of these species? 
Leonard Eisenberg's tree of life (https://www.evogeneao.com).

Friday, February 3, 2017

When is the same trait not the same?

Different clades and traits yield similar grassland functional responses. 2016. Elisabeth J. Forrestel, Michael J. Donoghue,  Erika J. Edwards,  Walter Jetz,  Justin C. O. du Toite, and Melinda D. Smith. vol. 114 no. 4, 705–710, doi: 10.1073/pnas.1612909114

A potential benefit of trait-centric approaches is that they may provide a path to generality in community ecology. Functional traits affect growth, reproduction, and survival, and so--indirectly--should determine an organism's fitness; differences in functional traits may delineate niche differences. Since fitness is dependent on the environment, it is generally predicted that there should be strong and consistent trait–environment relationships. Species with drought-tolerant traits will be most dominant in low precipitation regions, etc, etc. Since productivity should also relate to fitness, there should be strong and consistent trait–ecosystem functioning relationships.

There are also quite general descriptions of species traits, and the life histories they imbue (e.g. the leaf economic spectrum), implying again that traits can yield general predictions about an organism's ecology. Still, as McIntyre et al. (1999) pointed out, "A significant advance in functional trait analysis could be achieved if individual studies provide explicit descriptions of their evolutionary and ecological context from a global perspective."

A new(ish) paper does a good job of illustrating this need. In Forrestel et al. the authors compare functional trait values across two different grassland systems, which share very similar environmental gradients and grass families present but entirely different geological and evolutionary histories. The North American and South African grasslands share similar growing season temperatures and the same precipitation gradient, hopefully allowing comparison between regions. They differ in grass species richness (62 grass species in SA and 35 in NA) and species identity (no overlapping species), but contain the same major lineages (Figure below).
From Forrestel et a. Phylogenetic turnover for major lineages along a
precipitation gradient differed between the 2 regions.
Mean annual precipitation (MAP) is well-established as an important selective factor and many studies show relationships between community trait values and MAP. The authors measured a long list of relevant traits, and also determined the above ground net primary productivity (ANPP) for sites in each grassland. When they calculated the community weighted mean value (CWM) of traits along the precipitation gradient, for 6 of the 11 traits measured region was a significant covariate (figure below). The context (region) determined the response of those traits to precipitation.
From Forrestel et al.
Further, different sets of traits were the best predictors of ANPP in NA versus SA. In SA, specific leaf area and stomatal pore index were the best predictors of ANPP, while in NA height and leaf area were. The upside was that for both regions, models of ANPP explained reasonable amounts of variation (48% for SA, 60% for NA).

It's an important message: plant traits matter, but how they matter is not necessarily straightforward or general without further context. The authors note, "Instead, even within a single grass clade, there are multiple evolutionary trajectories that can lead to alternative functional syndromes under a given precipitation regime" 

Tuesday, January 24, 2017

The removal of the predatory journal list means the loss of necessary information for scholars.

We at EEB & Flow periodically post about trends and issues in scholarly publishing, and one issue that we keep coming back to is the existence of predatory Open Access journals. These are journals that abuse a valid publishing model to make a quick buck and use standards that are clearly substandard and are meant to subvert the normal scholarly publishing pipeline (for example, see: here, here and here). In identifying those journals that, though their publishing model and activities, are predatory, we have relied heavily on Beall's list of predatory journals. This list was created by Jeffrey Beall, with the goal of providing scholars with the necessary information needed to make informed decisions about which journals to publish in and to avoid those that likely take advantage of authors.

As of a few days ago, the predatory journal list has been taken down and is no longer available online. Rumour has it that Jeffrey Beall removed the list in response to threats of lawsuits. This is really unfortunate, and I hope that someone who is dedicated to scholarly publishing will assume the mantle.

However, for those who still wish to consult the list, an archive of the list still exists online -found here.

Friday, January 20, 2017

True, False, or Neither? Hypothesis testing in ecology.

How science is done is the outcome of many things, from training (both institutional and lab specific), reviewers’ critiques and requests, historical practices, subdiscipline culture and paradigms, to practicalities such as time, money, and trends in grant awards. ‘Ecology’ is the emergent property of thousands of people pursuing paths driven by their own combination of these and other motivators. Not surprisingly, the path of ecology sways and stalls, and in response papers pop up continuing the decades old discussion about philosophy and best practices for ecological research.

A new paper from Betini et al. in the Royal Society Open Science contributes to this discussion by asking why ecologists don’t test multiple competing hypotheses (allowing efficient falsification or “strong inference” a la Popper). Ecologists rarely test multiple competing hypothesis test: Betini et al. found that only 21 of 100 randomly selected papers tested 2 hypotheses, and only 8 tested greater than 2. Multiple hypothesis testing is a key component of strong inference, and the authors hearken to Platt’s 1964 paper “Strong Inference” as to why ecologists should be adopting adopt strong inference. 
Platt
From Platt: “Science is now an everyday business. Equipment, calculations, lectures become ends in themselves. How many of us write down our alternatives and crucial experiments every day, focusing on the exclusion of a hypothesis? We may write our scientific papers so that it looks as if we had steps 1, 2, and 3 in mind all along. But in between, we do busywork. We become "method-oriented" rather than "problem-oriented." We say we prefer to "feel our way" toward generalizations.
[An aside to say that Platt was a brutally honest critic of the state of science and his grumpy complaints would not be out of place today. This makes reading his 1964 paper especially fun. E.g. “We can see from the external symptoms that there is something scientifically wrong. The Frozen Method. The Eternal Surveyor. The Never Finished. The Great Man With a Single Hypothesis. The Little Club of Dependents. The Vendetta. The All-Encompassing Theory Which Can Never Be Falsified.”]
Betini et al. list a number of common practical intellectual and practical biases that likely prevent researchers from using multiple hypothesis testing and strong inference. These range from confirmation bias and pattern-seeking to the fallacy of factorial design (which leads to unreasonably high replication requirements including of uninformative combinations). But the authors are surprisingly unquestioning about the utility of strong inference and multiple hypothesis testing for ecology. For example, Brian McGill has a great post highlighting the importance and difficulties of multi-causality in ecology - many non-trivial processes drive ecological systems (see also). 

Another salient point is that falsification of hypotheses, which is central to strong inference, is especially unserviceable in ecology. There are many reasons that an experimental result could be negative and yet not result in falsification of a hypothesis. Data may be faulty in many ways outside of our control, due to inappropriate scales of analyses, or because of limitations of human perception and technology. The data may be incomplete (for example, from a community that has not reached equilibrium); it may rely inappropriately on proxies, or there could be key variables that are difficult to control (see John A. Wiens' chapter for details). Even in highly controlled microcosms, variation arises and failures occur that are 'inexplicable' given our current ability to perceive and control the system.

Or the data might be accurate but there are statistical issues to be concerned about, given many effect sizes are small and replication can be difficult or limited. Other statistical issues can also make falsification questionable – for example, the use of p-values as the ‘falsify/don’t falsify’ determinant, or the confounding of AIC model selection with true multiple hypothesis testing.

Instead, I think it can be argued that ecologists have relied more on verification – accumulating multiple results supporting a hypothesis. This is slower, logically weaker, and undoubtedly results in mistakes too. Verification is most convincing when effect sizes are large – e.g. David Schindler’s lake 226, which provided a single and principal example of phosphorus supplementation causing eutrophication. Unfortunately small effect sizes are common in ecology. There also isn’t a clear process for dealing with negative results when a field has relied on verification - how much negative evidence is required to remove a hypothesis from use, versus just lead to caveats or modifications?

Perhaps one reason Bayesian methods are so attractive to many ecologists is that they reflect the modified approach we already use - developing priors based on our assessment of evidence in the literature, particularly verifications but also evidence that falsifies (for a better discussion of this mixed approach, see Andrew Gelman's writing). This is exactly where Betini et al.'s paper is especially relevant – intellectual biases and practical limitations are even more important outside of the strict rules of strong inference. It seems important as ecologists to address these biases as much as possible. In particular, better training in philosophical, ethical and methodological practices; priors, which may frequently be amorphous and internal, should be externalized using meta-analyses and reviews that express the state of knowledge in unbiased fashion; and we should strive to formulate hypotheses that are specific and to identify the implicit assumptions.

Friday, January 13, 2017

87 years ago, in ecology

Louis Emberger was an important French plant ecologist in the first half of the last century, known for his work on the assemblages of plants in the mediterranean.

For example, the plot below is his published diagram showing minimum temperature of the coolest month versus a 'pluviometric quotient' capturing several aspects of temperature and precipitation from:

Emberger; La végétation de la région méditerranienne. Rev. Gén. Bot., 42 (1930)

Note this wasn't an unappreciated or ignored paper - it received a couple hundred citations, up until present day. Further, updated versions have appeared in more recent years (see bottom).

So it's fascinating to see the eraser marks and crossed out lines, this visualisation of scientific uncertainty. The final message from this probably depends on your perspective and personality:
  • Does it show that plant-environment modelling has changed a lot or that plant environmental modelling is still asking about the same underlying processes in similar ways?
  • Does this highlight the value of expert knowledge (still cited) or the limitations of expert knowledge (eraser marks)? 
It's certainly a reminder of how lucky we are to have modern graphical software :)



E.g. updated in Hobbs, Richard J., D. M. Richardson, and G. W. Davis. "Mediterranean-type ecosystems: opportunities and constraints for studying the function of biodiversity." Mediterranean-Type Ecosystems. Springer Berlin Heidelberg, 1995. 1-42.











Thanks to Eric Garnier, for finding and sharing the original Emberger diagram and the more recent versions.