Making Sense of Replicability

A number of developments suggest a crisis in science associated with the requirement of replicability. The issues are complex, especially where scientific results include statistical estimates.

Originally posted 13/1/2014. Re-posted following site reorganisation 21/6/2016.

A widely accepted requirement of good scientific research is that results should be replicable. However, there is concern that much published research may fail to meet this requirement, and that aspects of the organisation of science – including criteria for academic appointments and publication of papers – do not provide appropriate incentives for testing replicability and facilitating the correction of errors. Once perhaps confined to a small minority, in recent years these concerns have become a topic of mainstream interest in at least several areas of science. In medicine, for example, the journal Infection and Immunity contained in 2010 an editorial entitled Reproducible Science. Significantly, the editors (Casadevall & Fang) acknowledged not only that the assumed reproducibility of published science is rarely tested, but also that their own journal was unlikely to accept papers that replicated previously published findings (1). In psychology, a group of 16 authors from 6 countries (Asendorpf et al) published in 2012 a paper entitled Recommendations for Increasing Replicability in Psychology (2). Where replication has been attempted, the findings have often cast doubt on the originally published results (3).

I hope in a future post to comment on the relevance of this debate to environmental valuation. Here I make some general observations relevant to many sciences.

Lack of Generally Accepted Definitions

An initial difficulty in making sense of this debate is that there seem to be no generally accepted definitions of key terms. Where academic journals have policies on disclosure of methods or data, they may refer to either replicability or reproducibility, terms which may appear interchangeable. Writing in a machine learning context, however, Drummond identifies reproducibility as the critical scientific requirement, taking it to refer to reproducibility of results (4). He makes the important point that to obtain the same result from two experiments is a much more powerful finding where the experiments are quite different than where they are identical. Casadevall & Fang make essentially the same point in stating that a finding which is highly dependent on precise experimental conditions may be of limited interest (5). The implication is that scientists should not try to replicate every detail of an experiment, but should seek rather to replicate the essential and vary the inessential. What is essential will depend upon the nature and scope of the result the experiment has been held to support (the more general the claimed result, the less detail will be essential). Thus being able to obtain the same result from an exact repetition of an experiment is of limited value, and it is for this that Drummond reserves the term replicability (6).

For Asendorpf et al, however, replicability is being able to obtain similar results from different random samples drawn from a multi-dimensional space that represents the key aspects of the research design (7). This abstract formulation has the merit of embracing survey-based as well as experimental research. Leaving that aside, it seems quite close to reproducibility in Drummond’s sense. But reproducibility in the terminology of Asendorpf et al is simply data reproducibility, that is, being able, given a researcher’s data and analytical methods, to reproduce the original analysis and obtain the same results (8). We are left with the confusing conclusion that the terms replicability and reproducibility are distinguished by different writers in quite different ways.

Three Requirements of Scientific Studies

Let us take from this discussion the following requirements, not attempting to label them other than by number:

Being able to obtain results similar to those obtained from an original study, given its data, by using a similar method of data analysis.
Being able to obtain data and results similar to those obtained from an original study by undertaking similar experiments or surveys and using a similar method of data analysis.
Being able to obtain conclusions similar to those obtained from an original study by undertaking different experiments or surveys (and perhaps as a consequence a different method of data analysis).

We can assert, broadly, that 1-3 are all important since an inability of any of these types would cast doubt on the original study, but that to establish interesting new results it is 3 that is crucial.

However, the interpretation of 1-3 raises some further questions. What exactly do we mean by being able? In 1, the main conditions for being able to obtain similar results are the availability of the data and the method of analysis, and an absence of error in the original analysis. The focus, therefore, is on the original researchers: have they published or otherwise made available their data and methods, and did they apply their method correctly? Journal publishers and editors have a key role in ensuring that these conditions are met. In 3, the main conditions for being able to obtain similar conclusions are the validity of the original conclusions, and the availability of alternative means of testing those conclusions. The focus, therefore, is on how the world is: whether it is such that the result is approximately true, and whether it offers scope for alternative means of testing. Again, the fact that 3 relates to the world, not to the original researchers, highlights its crucial importance. In this respect 2 is somewhere between 1 and 3: the focus is partly on whether the original results were obtained with due care, but also partly on whether a difference in some unknown causal factor might produce different results in apparently similar circumstances.

Similarity of Results

The term similar also needs interpretation. Here I will focus on similarity of results. A key question is how to interpret similarity where, as is often the case, results take the form of statistical estimates. An obvious answer is that similar means no more different than can reasonably be attributed to sampling and measurement error, but this requires further interpretation. Suppose the research question is whether variable X has a material positive effect on variable Y, with the threshold for materiality being taken to be a regression coefficient B exceeding 10.0. An initial study estimates B at say 11.6, with a standard error of 0.8. A second study estimates B at say 10.4 with a standard error of 0.6. Given these results we can apply the following hypothesis tests (9):

Test 1: Null hypothesis: B does not exceed 10.0. Tested using results from first study. Conclusion: Reject null hypothesis at 5% significance level (p = 0.02).

Test 2: Null hypothesis: B does not exceed 10.0. Tested using results from second study. Conclusion: Do not reject null hypothesis at 5% significance level (p = 0.25).

Test 3: Null hypothesis: the samples in the two studies were drawn randomly from the same population, and therefore if the methods of both studies were repeated many times the mean estimate of B obtained by the method of the first study would not exceed the mean estimate of B obtained by the method of the second study. Tested using results from both studies. Conclusion: Do not reject null hypothesis at 5% significance level (p = 0.12).

Thus the two studies yield different conclusions at the 5% significance level regarding the coefficient B, but their results are not so different that it would be implausible to attribute the difference to sampling error. Such a situation is not well described by saying that the results of the two studies are dissimilar, nor by saying that they are similar. Where results are of a statistical nature, there is no sharp distinction between similarity and dissimilarity. A more appropriate statement would be that the results of the second study differ from those of the first, but not by more than can reasonably be attributed to random variation between samples. Such a situation is unlikely to be resolved without further studies.

Notes and References

1. Casadevall A & Fang F C (2010) Editorial: Reproducible Science Infection and Immunity 78(12) pp 4972-5 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2981311/

2. Asendorpf J B, Conner M & 14 others (2013) Recommendations for increasing replicability in psychology European Journal of Personality 27(2) pp 108-119 http://onlinelibrary.wiley.com/doi/10.1002/per.1919/abstract

3. Some examples are described in:
Zimmer C (25/6/2011) It’s Science, But Not Necessarily Right The New York Times http://www.nytimes.com/2011/06/26/opinion/sunday/26ideas.html?emc=eta1&_r=1& Trouble at the lab, The Economist (19/10/2013) pp 23-27

4. Drummond C (2009) Replicability is not Reproducibility: Nor is it Good Science Proceedings of the Evaluation Methods for Machine Learning Workshop at the 26th ICML, Montreal, Canada, 2009 p 2 http://www.site.uottawa.ca/ICML09WS/papers/w2.pdf

5. Casadevall A & Fang F C, as 1 above, p 4973

6. Drummond, as 4 above, p 2

7. Asendorpf et al, as 2 above, p 109

8. Asendorpf et al, as 2 above, p 109

9. The test calculations assume that the samples are large enough that the distributions of the test statistics approximate to the normal distribution. For test 1, z = (11.6 – 10.0) / 0.8 = 2.0, hence (from tables) cumulative probability of the standard normal distribution = 0.98, so p = 0.02. For test 2, z = (10.4 – 10.0) / 0.6 = 0.67, hence cumulative probability = 0.75, so p = 0.25. Test 3 is a comparison of means with unequal variance, the test statistic being (11.6 – 10.4) / (Sqrt(Sum of squares of 0.8 & 0.6)) = 1.2, hence cumulative probability = 0.88, so p = 0.12.