P-values are an integral part of most scientific analyses, papers, and journals, and yet they come with a hefty list of concerns and criticisms from frequentists and Bayesians alike. An editorial in Nature (by Regina Nuzzo) last week provides a good reminder of some of the more concerning issues with the p-value. In particular, she explores how the obsession with "significance" creates issues with reproducibility and significant but biologically meaningless results.
Ronald Fischer, inventor of the p-value, never intended it to be used as a definitive test of “importance” (however you interpret that word). Instead, it was an informal barometer of whether a test hypothesis was worthy of continued interest and testing. Today though, p-values are often used as the final word on whether a relationship is meaningful or important, on whether the the test or experimental hypothesis has any merit, even on whether the data is publishable. For example in ecology, significance values from a regression or species distribution model are often presented as the results.
This small but troubling shift away from the original purpose for p-values is tied to concerns about false alarms and with replicability of results. One recent suggestion for increasing replicability is to make p-values more stringent - to require that they be less that 0.005. But the point the author makes is that although p-values are typically interpreted as “the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true”, this doesn't actually mean that a p-value of 0.01 in one study is exactly consistent with a p-value of 0.01 found in another study. P-values are not consistent or comparable across studies because the likelihood that there was a real (experimental) effect to start with alters the likelihood that a low p-value is just a false alarm (figure). The more unlikely the test hypothesis, the more likely a p-value of 0.05 is a false alarm. Data mining in particular will be (unwittingly) sensitive to this kind of problem. Of course one is unlikely to know what the odds of the test hypothesis are, especially a priori, making it even more difficult to correctly think about and use p-values.
The other oft-repeated criticism of p-values is that a highly significant p-value make still be associated with a tiny (and thus possibly meaningless) effect size. The obsession with p-values is particularly strange then, given that the question "how large is the effect?", should be more important than just answering “is it significant?". Ignoring effect sizes leads to a trend of studies showing highly significant results, with arguably meaningless effect sizes. This creates the odd situation that publishing well requires high profile, novel, and strong results – but one of the major tools for identifying these results is flawed. The editorial lists a few suggestions for moving away from the p-value – including to have journals require effect sizes and confidence intervals be included in published papers, to require statements to the effect of “We report how we determined our sample size, all data exclusions (if any), all manipulations and all measures in the study”, in order to limit data-mining, or of course to move to a Bayesian framework, where p-values are near heresy. The best advice though, is quoted from statistician Steven Goodman: “The numbers are where the scientific discussion should start, not end.”