How sensitive are regression estimates based on aggregated data, zonal travel cost datasets for example, to the particular form of aggregation? Here I present an answer for one simple case.
Originally posted 21/3/2016. Re-posted following site reorganisation 21/6/2016.
This is a brief introduction to my paper entitled:
The Maximum Difference Between Regression Coefficients Estimated from Different Levels of Aggregation of the Same Underlying Data: A Theorem and Discussion
It can be downloaded here: Maximum Difference Theorem Adam Bailey 12.3.2016.
In a previous post, I used a case study to show that the results of a zonal travel cost study can be sensitive to zone definition. In other words, aggregations of the same underlying data within different zonal configurations can yield different results. The case also showed that such differences can be quite large, as is illustrated in Charts 2 and 3 of that post.
This finding has a bearing on the application to zonal travel cost studies of the requirement that scientific research be replicable. Suppose two researchers undertake independent studies including separate surveys to collect data, analyse their respective data within different zonal configurations, and obtain different results. How large a difference would indicate a failure of replication and cast doubt on the results of one or other of the studies? Sampling error is an issue to be considered but not the only one. Also relevant is the difference due to the different zonal configurations. For sampling error, we can use well-established methods, such as standard errors of regression parameter estimates and hypothesis testing, to determine how large a difference in result can reasonably be attributable to that source. But if we ask how much difference can reasonably be attributed to different zonal configurations, there are – so far as I am aware – no established methods available.
This line of thought led me to consider whether there is any theoretical maximum to the differences in regression parameter estimates that can arise from different levels of aggregation of the same underlying data. Note that this is an abstract formulation of the problem. Hence a solution would be of relevance to attempted replications not just of zonal travel cost studies but of findings based on aggregated data in any field of research.
In its full generality, the problem appears intractable. Complications to be addressed would include multiple regression, alternative functional forms, and alternative estimation techniques. However, I obtained a solution for a simple case involving higher and lower level datasets meeting the following conditions:
- A bivariate regression model with linear functional form.
- Estimation by ordinary least squares.
- Each value of the independent and dependent variables in the high-level dataset is the unweighted aggregation of a pair of such values in the lower-level dataset.
Given these assumptions, the maximum difference between the estimated slope parameters based on the two datasets can be shown to be a function of:
- The variance
of the independent variable in the lower-level dataset.
- The maximum t of the absolute differences between each aggregation pair of values of the independent variable in the lower-level dataset
- The mean absolute value r of the residuals in the regression based on the lower-level dataset.
Specifically:
Maximum Difference
The proof, given in full in the paper, uses only basic regression theory and elementary algebra.
The paper also presents a simple example of a zonal travel cost dataset, showing that the slope parameters estimated from datasets obtained by pairings of the original dataset are within the limit defined by the theorem. It concludes with a consideration of the application of the theorem in testing whether the results of one research study replicate the results of another, and with suggestions for further research.