Thursday, April 24, 2014

Data merging: are we moving forward or dealing with Frankenstein's monster


I’m sitting in the Sydney airport waiting for my delayed flight –which gives me some time to ruminate about the mini-conference I am leaving. The conference, hosted by the Centre for Biodiversity Analysis (CBA) and CSIRO in Australia, on "Understanding biodiversity dynamics using diverse data sources", brought together several fascinating thinkers working on disparate areas including ecology, macroecology, evolution, genomics, and computer science. The goal of the conference was to see if merging different forms of data could lead to greater insights into biodiversity patterns and processes. 

Happy integration

On the surface, it seems uncontroversial to say that bringing together different forms of data really does promote new insights into nature. However, this only really works if the data we combine meaningfully complement one another. When researchers bring together data, there are under-appreciated risks, and the resulting effort could result in trying to combine data that make weird bedfellows.
Weird bedfellows

The risks include data that mismatch in the scale of observation, resulting in meaningful variation being missed. Data are often generated according to certain models with specific assumptions, and these data-generation steps can be misunderstood by end-users, resulting in inappropriate uses of data. Further, different data may be combined in standard statistical models, but the linkages between data types is much more subtle and nuanced, requiring alternative models.

Why these are issues stems from the fact that researchers now have an unprecedented access to numerous large data sets. Whether these are large trait data sets, spatial locations, spatial environmental data, genomes, or historical data, they are all built with specific underlying uses, limitations and assumptions.  

Regardless of these issues of concern, the opportunity and power to address new questions is greatly enhanced by multiple types of data. One thing I gained from this meeting is that there is a new world of biodiversity analysis and understanding emerging by smart people doing smart things with multiple data. We will soon live in a world where the data and analytical tools allow research to truly combine multiple processes to predict species' distributions, or to move from evolutionary events in deep history to modern day ecological patterns.


2 comments:

dwbapst said...

Is paleo / fossil record data included? Just curious.

Marc Cadotte said...

Definitely. It seems that people often talk about incorporating paleo data, but I wonder if we really understand the complexity and assumptions of this data