Making Sense of Replicability

A number of developments suggest a crisis in science associated with the requirement of replicability.  The issues are complex, especially where scientific results include statistical estimates.

Originally posted 13/1/2014.  Re-posted following site reorganisation 21/6/2016.

A widely accepted requirement of good scientific research is that results should be replicable.  However, there is concern that much published research may fail to meet this requirement, and that aspects of the organisation of science – including criteria for academic appointments and publication of papers – do not provide appropriate incentives for testing replicability and facilitating the correction of errors.  Once perhaps confined to a small minority, in recent years these concerns have become a topic of mainstream interest in at least several areas of science.   In medicine, for example, the journal Infection and Immunity contained in 2010 an editorial entitled Reproducible Science.  Significantly, the editors (Casadevall & Fang) acknowledged not only that the assumed reproducibility of published science is rarely tested, but also that their own journal was unlikely to accept papers that replicated previously published findings (1).  In psychology, a group of 16 authors from 6 countries (Asendorpf et al) published in 2012 a paper entitled Recommendations for Increasing Replicability in Psychology (2).   Where  replication has been attempted, the findings have often cast doubt on the originally published results (3).

I hope in a future post to comment on the relevance of this debate to environmental valuation.  Here I make some general observations relevant to many sciences.

Lack of Generally Accepted Definitions

An initial difficulty in making sense of this debate is that there seem to be no generally accepted definitions of key terms.  Where academic journals have policies on disclosure of methods or data, they may refer to either replicability or reproducibility, terms which may appear interchangeable.  Writing in a machine learning context, however, Drummond identifies reproducibility as the critical scientific requirement, taking it to refer to reproducibility of results (4).  He makes the important point that to obtain the same result from two experiments is a much more powerful finding where the experiments are quite different than where they are identical.  Casadevall & Fang make essentially the same point in stating that a finding which is highly dependent on precise experimental conditions may be of limited interest (5).  The implication is that scientists should not try to replicate every detail of an experiment, but should seek rather to replicate the essential and vary the inessential.  What is essential will depend upon the nature and scope of the result the experiment has been held to support (the more general the claimed result, the less detail will be essential).  Thus being able to obtain the same result from an exact repetition of an experiment is of limited value, and it is for this that Drummond reserves the term replicability (6).

For Asendorpf et al, however, replicability is being able to obtain similar results from different random samples drawn from a multi-dimensional space that represents the key aspects of the research design (7).  This abstract formulation has the merit of embracing survey-based as well as experimental research. Leaving that aside, it seems quite close to reproducibility in Drummond’s sense.  But reproducibility in the terminology of Asendorpf et al is simply data reproducibility, that is, being able, given a researcher’s data and analytical methods, to reproduce the original analysis and obtain the same results (8).  We are left with the confusing conclusion that the terms replicability and reproducibility are distinguished by different writers in quite different ways.

Three Requirements of Scientific Studies

Let us take from this discussion the following requirements, not attempting to label them other than by number:

  1. Being able to obtain results similar to those obtained from an original study, given its data, by using a similar method of data analysis.
  2. Being able to obtain data and results similar to those obtained from an original study by undertaking similar experiments or surveys and using a similar method of data analysis.
  3. Being able to obtain conclusions similar to those obtained from an original study by undertaking different experiments or surveys (and perhaps as a consequence a different method of data analysis).

We can assert, broadly, that 1-3 are all important since an inability of any of these types would cast doubt on the original study, but that to establish interesting new results it is 3 that is crucial.

However, the interpretation of 1-3 raises some further questions.  What exactly do we mean by being able?  In 1, the main conditions for being able to obtain similar results are the availability of the data and the method of analysis, and an absence of error in the original analysis. The focus, therefore, is on the original researchers: have they published or otherwise made available their data and methods, and did they apply their method correctly?  Journal publishers and editors have a key role in ensuring that these conditions are met.  In 3, the main conditions for being able to obtain similar conclusions are the validity of the original conclusions, and the availability of alternative means of testing those conclusions.  The focus, therefore, is on how the world is: whether it is such that the result is approximately true, and whether it offers scope for alternative means of testing.  Again, the fact that 3 relates to the world, not to the original researchers, highlights its crucial importance.  In this respect 2 is somewhere between 1 and 3: the focus is partly on whether the original results were obtained with due care, but also partly on whether a difference in some unknown causal factor might produce different results in apparently similar circumstances.

Similarity of Results

The term similar also needs interpretation. Here I will focus on similarity of results.  A key question is how to interpret similarity where, as is often the case, results take the form of statistical estimates.  An obvious answer is that similar means no more different than can reasonably be attributed to sampling and measurement error, but this requires further interpretation.  Suppose the research question is whether variable X has a material positive effect on variable Y, with the threshold for materiality being taken to be a regression coefficient B exceeding 10.0.  An initial study estimates B at say 11.6, with a standard error of 0.8.  A second study estimates B at say 10.4 with a standard error of 0.6.  Given these results we can apply the following hypothesis tests (9):

Test 1: Null hypothesis: B does not exceed 10.0.  Tested using results from first study.  Conclusion: Reject null hypothesis at 5% significance level (p = 0.02).

Test 2:  Null hypothesis: B does not exceed 10.0.  Tested using results from second study.  Conclusion: Do not reject null hypothesis at 5% significance level (p = 0.25).

Test 3:  Null hypothesis: the samples in the two studies were drawn randomly from the same population, and therefore if the methods of both studies were repeated many times the mean estimate of B obtained by the method of the first study would not exceed the mean estimate of B obtained by the method of the second study.  Tested using results from both studies.  Conclusion: Do not reject null hypothesis at 5% significance level (p = 0.12).

Thus the two studies yield different conclusions at the 5% significance level regarding the coefficient B, but their results are not so different that it would be implausible to attribute the difference to sampling error.  Such a situation is not well described by saying that the results of the two studies are dissimilar, nor by saying that they are similar. Where results are of a statistical nature, there is no sharp distinction between similarity and dissimilarity.  A more appropriate statement would be that the results of the second study differ from those of the first, but not by more than can reasonably be attributed to random variation between samples.  Such a situation is unlikely to be resolved without further studies.

Notes and References

1. Casadevall A & Fang F C (2010)  Editorial: Reproducible Science  Infection and Immunity 78(12) pp 4972-5  http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2981311/

2. Asendorpf J B, Conner M & 14 others (2013) Recommendations for increasing replicability in psychology  European Journal of Personality 27(2) pp 108-119 http://onlinelibrary.wiley.com/doi/10.1002/per.1919/abstract

3.  Some examples are described in:
Zimmer C (25/6/2011)  It’s Science, But Not Necessarily Right  The New York Times http://www.nytimes.com/2011/06/26/opinion/sunday/26ideas.html?emc=eta1&_r=1& Trouble at the lab, The Economist (19/10/2013) pp 23-27

4. Drummond C (2009)  Replicability is not Reproducibility: Nor is it Good Science  Proceedings of the Evaluation Methods for Machine Learning Workshop at the 26th ICML, Montreal, Canada, 2009  p 2  http://www.site.uottawa.ca/ICML09WS/papers/w2.pdf

5.  Casadevall A & Fang F C, as 1 above, p 4973

6.  Drummond, as 4 above, p 2

7.  Asendorpf et al, as 2 above, p 109

8.  Asendorpf et al, as 2 above, p 109

9. The test calculations assume that the samples are large enough that the distributions of the test statistics approximate to the normal distribution. For test 1, z = (11.6 – 10.0) / 0.8 = 2.0, hence (from tables) cumulative probability of the standard normal distribution = 0.98, so p = 0.02.  For test 2, z = (10.4 – 10.0) / 0.6 = 0.67, hence cumulative probability = 0.75, so p = 0.25.  Test 3 is a comparison of means with unequal variance, the test statistic being (11.6 – 10.4) / (Sqrt(Sum of squares of 0.8 & 0.6)) = 1.2, hence cumulative probability = 0.88, so p = 0.12.

Posted in Scientific method | Tagged , , | Leave a comment

The Value of Time in the Travel Cost Method

Travel costs used in valuing recreational sites often include the value of travel time at around 1/3 of the wage rate.  A recent study suggests the fraction should be higher, for some travellers at least.

Originally posted 3/12/2013.  Re-posted following site reorganisation 21/6/2016.

Suppose researchers seek to estimate the use value of a non-market recreational site using the travel cost method.  They select a model for a trip-generating function in which travel cost is one of the explanatory variables.  Usually they will take travel cost to be the sum of the monetary costs incurred (eg the cost of petrol) and the value of the time taken.  They must then decide what monetary value to place on travel time, and this decision will probably have a large impact on their site valuation.  A study of a coastal wetland site in Louisiana, USA, for example, estimated an annual consumer surplus of $3.9  million if time were valued at the average wage, but only $1.3 million if time were valued at 10% of the wage (1).

It is possible, admittedly, to avoid the time value problem by using models in which recreationists maximise utility subject to separate money and time constraints (2).  Such models relate naturally to the economic theory of household behaviour, and can take account of differences in labour market circumstances, such as whether individuals work fixed hours for a fixed income or have flexibility to vary their hours and income.  However, this approach is more complex, requires collection of additional data, and is unlikely to be practicable when data limitations or other considerations lead researchers to adopt the aggregate (zonal) form of the travel cost method.

An old but still influential paper by Cesario drew on US and UK studies by transport planners which estimated the value of time by studying the behaviour of travellers when alternative modes of travel were available.  Suppose for example that mode A at 60 mph costs £0.2 per mile while mode B at 40 mph costs £0.1 per mile.  Then a journey of 240 miles by mode A takes 4 hours at a cost of £48, and by mode B takes 6 hours at a cost of £24.  If a traveller opts for mode A, saving 2 hours’ time but costing £24 more, this suggests that his value of time is at least £12 per hour.  Cesario concluded that the average value of non-work travel time was between 25% and 50% of the wage rate, but that even for one person the value of time may vary with the purpose and length of the journey and other factors (3).

A recent study by Fezzi, Bateman and Ferrini suggests a much higher range for  the value of individuals’ time: between less than 50% and more than 100% of their wage rate (4).  It focused on travellers to beaches on the Italian Riviera whose journeys offered a choice between toll highways and open-access roads, the former being faster but subject to toll fees.  Thus an advantage of the study was that it related specifically to travel to recreational sites, rather than generally to non-work travel.  It also used Monte Carlo analysis to evaluate alternative assumptions about the value of time for use when individual data was not available.  The best assumption was identified as 75% of the average wage rate, while an assumption of 1/3 of the wage rate (as commonly used following Cesario) was shown to result in downwardly biased site valuations (5).

The study by Fezzi et al is in many ways well designed.  However, it is unclear to what extent its conclusions are generalisable to different sites and different groups of visitors.  One point is that the study, by its very design, focuses on car-users.  Perhaps those who use a car tend to place a higher value on their time than those who do not, even if the effect of income is controlled for.  Another is that the circumstances in which the travellers chose between toll highways and open-access roads may not be representative of those faced by the generality of car-using recreationists.  The journey times faced by travellers if opting for the fastest non-toll routes showed a mean of about 4 hours with a maximum of over 12 hours (6).  Of the sampled visitors, 34% were making day trips and 66% staying for longer holidays (7).  It appears possible therefore that a significant proportion of the sampled visitors were in one or other of the following circumstances in which their value of time might be expected to be above its average level:

  1. A group wishes to visit a site on a day trip, to avoid the expense of an overnight stay at the site or impingement on their work or other plans for the following day.  However, the distance they have to travel is such that it is difficult, using open-access roads, to complete their trip within one day, leaving and returning at acceptable times and allowing sufficient on-site time for the trip to be worthwhile.
  2. A group has arranged a longer holiday at a site, but wishes to complete their journey to the site in a single day to avoid the expense of an overnight stay on the way, and similarly for their return journey.  However, the distance is such that it is difficult, using open-access roads, to complete the one-way journey in a single day, starting and finishing at acceptable times.

Under either of these scenarios there might be other factors adding to the pressure to choose a faster route: allowing a driver time for breaks within a long journey; a preference for not driving on rural roads in the dark; minimising the boredom of children in the back seat; and so on.  Suppose a group’s normal value of leisure time, revealed by the choices they make on trips when the length of the day is not a constraint, is such that they would normally avoid toll highways when an open-access road is available.  Nevertheless, in the circumstances described they will probably choose the toll highway.  If that choice is then viewed in isolation, without regard to the circumstances prompting it, and a conclusion is drawn regarding the group’s normal value of time, it is likely that the result will be upwardly biased.

The study by Fezzi et al is a useful contribution to a debate that is far from resolved.  It provides some justification for valuing time at more than the common 1/3 of the wage rate for recreational visits involving car journeys of several hours.  However, it would be wrong to infer from this study that travel time should be valued at 75% of the wage rate for all kinds of recreational sites and visitors.

Notes and References

1.  Farber S (1988)  The value of coastal wetlands for recreation: an application of travel cost and contingent valuation methodologies  Journal of Environmental Management  26 p 305

2.  See for example Larson D & Shaikh S (2001)  Empirical Specification Requirements for Two-Constraint Models of recreation Choice  American Journal of Agricultural Economics 83(2)  pp 429-430

3.  Cesario F (1967)  Value of Time in Recreation Benefit Studies  Land Economics 52(1) p 37

4. Fezzi C, Bateman I & Ferrini S (Authors’ Accepted Manuscript 2013)  Using Revealed Preferences to Estimate the Value of Travel Time to Recreation Sites  Journal of Environmental Economics and Management  p 17   http://www.sciencedirect.com/science/article/pii/S0095069613000880

5.  Fezzi et al, as above pp 20-1

6.  Fezzi et al, as above p 26 Table 1

7.  Fezzi et al, as above p 9

Posted in Recreation | Tagged , | Leave a comment

The Travel Cost Method – Some Pitfalls

The travel cost method is an essential tool in estimating the economic value of non-market recreational sites such as parks,  but its application is rarely straightforward.  Here are some of its pitfalls.

Originally posted 11/8/2012.  Re-posted following site reorganisation 21/6/2016.

There are three stages in a travel cost site valuation (1):

  1. Obtain data on visit rates, visitors’ travel costs, and perhaps other variables.  The data may be either at the level of individuals, or aggregated by zones of origin.
  2. Use regression analysis to estimate how the visit rate depends on travel cost and any other variables.
  3. Derive a demand curve for visits to the site, and find the associated consumer surplus which is a measure of the site’s use value.

Pitfall 1 – Data Collection via an Off-site Survey     Contacting every nth person on the electoral list by mailshot or telephone might seem a simple means of ensuring an unbiased dataset.  However, response rates could be low, and a high proportion of people might not have visited the site at all.  Hence a very large sample, with high data collection costs, might be needed to obtain a useful dataset.  Bias might also be introduced by a higher response rate from those who have visited the site.  Probably for these reasons, most travel cost valuations are based on on-site surveys  (2).

Pitfall 2 – Simplistic Analysis of Individual On-site Survey Data     Any on-site survey, whether via short interviews, or handing-out of questionnaires to be completed and returned, over-represents people who visit the site relatively often and excludes those who do not visit it at all.  Hence direct application to individual on-site data of standard regression techniques such as ordinary or weighted least squares will lead to inflated estimates of visit rates, and therefore of the site value.  The on-site individual approach, although the most common in recent literature, requires specialised techniques to correct for these sample characteristics (3).

Pitfall 3 – Zonal Bias in On-site Data     The older but still widely used zonal approach also begins with a survey of individuals, to provide raw data for aggregation at zonal level.  To minimise data collection costs, it might seem sensible to choose a sunny weekend when many visitors are on site.  However, this could result in disproportionate selection of visitors from more distant zones, leading to an inflated estimate of site value.  To avoid this type of bias, the sample may need to be stratified with respect to variables such as day of week, weather conditions, and location within a large, multi-entrance site (4).

Pitfall 4 – Inappropriate Treatment of Zero-Visit Zones     When using the zonal approach, it may be found that the on-site sample of visitors contains no visitors from certain zones. Such zero-visit zones can be awkward to handle in the regression analysis, and it can be tempting to drop them from the dataset, but this can lead to a biased valuation (see diagram below).  The basic principle here must be that any observed visit rate, positive or zero, is part of the data to be analysed (5).  It can however be appropriate sometimes to drop some zero-visit zones as a step in the data analysis, prior to the regression.  A dataplot may suggest an approximate choke price (travel cost above which the visit rate is zero).  It could then be appropriate to drop those zero-visit zones with travel costs above that price (see diagram).

Pitfall 5 – Underestimating the Frequency and Importance of Multi-Purpose Trips     Multi-purpose and multi-destination trips may take a variety of forms.  At one extreme is the person on a foreign holiday visiting a national park and other tourist sites.  However, visitors to urban parks are often engaged in short multi-purpose trips, perhaps involving meeting friends, shopping or going to a restaurant.  Multi-purpose and multi-destination trips complicate the determination of travel costs and, for the zonal approach, the assignment of individuals to zones of origin.  Every travel cost study needs a strategy for handling them at both the data collection and the analysis stages. There is a large literature on this problem (6), with no one-size-fits-all solution.

Notes and references

1.  A useful and accessible explanation of the travel cost method is Karasin L The Travel Cost Method: Background, Summary, Explanation and Discussion  http://www.ulb.ac.be/ceese/PAPERS/TCM/TCM.html

2.  A rare example of a travel cost valuation based on an off-site survey is described in Gum R L & Martin W E (1975) Problems and Solutions in Estimating the Demand for and Value of Rural Outdoor Recreation  American Journal of Agricultural Economics 57 pp 558-566.

3.  Englin E & Shonkwiler J S (1995)  Estimating Social Welfare Using Count Data Models: An Application to Long-Run Recreation Demand Under Conditions of Endogenous Stratification and Truncation  The Review of Economics and Statistics  77(1) pp 104-112.

4.  An on-site survey with stratified sampling is described in Rolfe J & Prayaga P (2007) Estimating Values for Recreational Fishing at Freshwater Dams in Queensland  The Australian Journal of Agricultural and Resource Economics 51 p 160.

5.  The principle is asserted in Christensen J B & Price C (1982) A Note on the Use of Travel Cost Models with Unequal Zonal Populations: Comment   Land Economics  Vol 58(3) p 399.  An example of a study explicitly stating that zero-visit zones are included in the analysis is Mendelson R, Hof J, Peterson G & Johnson R (1992)  Measuring Recreation Values with Multiple Destination Trips  American Journal of Agricultural Economics 74 p 931.

6.  See for example Loomis J (2006) A Comparison of the Effect of Multiple Destination Trips on Recreation Benefits as Estimated by Travel Cost and Contingent Valuation Methods  Journal of Leisure Research 38(1) pp 46-60.

Posted in Recreation | Tagged , | Leave a comment

Brexit and Fisheries

If the UK left the EU, it would take its sea fisheries with it.

It’s outside the scope of this blog to give a view on whether the UK should remain within or leave the European Union; most of the issues have little to do with the environment or natural resources.  However, one consequence of leaving would be that the sea within the UK’s  Exclusive Economic Zone (EEZ) extending up to 200 miles from its coast would no longer be subject to the EU’s Common Fisheries Policy.  That has implications for the UK fishing industry, for management of fish stocks within the UK’s EEZ, and for the UK’s enforcement of its EEZ rights.

Many aspects of the EU are about removing barriers: the Common Fisheries Policy is different in that it is about sharing a resource.  A Spanish farmer, for example, can sell his produce in the UK without any import duty being levied on it.  He can also move to the UK.  What he can’t do, unless he is prepared to pay the market rate for land, is to start farming on UK land.  A Spanish fisherman, by contrast, has the same freedoms as the farmer, but in addition, without any payment, can sail his boat into UK waters and catch fish; a UK fisherman can do the same in Spanish waters.  Fish within the EEZ’s around the coast of Europe of EU members are managed as a common resource.

The main instrument of management is a Total Allowable Catch for each species and area, determined annually by the EU Council of Ministers with advice from the EU Commission. Member states are then allocated shares (quotas) of the Total Allowable Catch, which they in turn allocate among fishermen.  Enforcement of quotas is the responsibility of member states and, although some inspections take place at sea, for reasons of practicality is mainly exercised where fish are landed.

It is widely accepted that the record of the Common Fisheries Policy in conserving fish stocks is poor.  If the UK were no longer subject to the Policy, is it likely that it would do better in conserving fish stocks within its EEZ for the long-term benefit of its fishing industry?  Here are some reasons why it might:

  1. The UK government would be able to determine its own annual catch limits. With no need to bargain with other countries over quotas, it is more likely that science-based advice on maximum sustainable yields would be respected.
  2. The UK would be free to consider following Iceland and New Zealand in adopting individual transferable fishing quotas, an approach that may help encourage conservation (1) but has not found favour with the EU.
  3. UK fishermen may be more inclined to full compliance with a fishery policy determined by the UK and which they can see is for their industry’s long-term benefit.

And here are some reasons why it might not:

  1. Some fish species such as mackerel migrate in and out of the UK’s EEZ, which may therefore be too small an area of sea for effective management of conservation.
  2. UK politicians, and those who lobby them, may be no less likely to take a short-term view than those in other countries.
  3. Effective enforcement to prevent foreign boats coming to fish within the UK’s EEZ might not be a priority for government expenditure, and might also be constrained by a reluctance to create diplomatic incidents.

The balance of advantage for the UK seems unclear.

Notes and References

  1. Arnason R (2012) Property Rights in Fisheries: How Much Can Individual Transferable Quotas Accomplish?  Review of Environmental Economics and Policy  6(2) pp 217-236  http://reep.oxfordjournals.org/content/6/2/217
Posted in Fisheries | Tagged , , , , , , | Leave a comment