**Differences in national rates of Covid-19 infection may be partly due to differences in household sizes.**

While many questions about the Covid-19 virus are currently unanswered, one point on which there has been wide agreement is that transmission is more likely indoors than outdoors (1,2). If therefore we are to explain differences in national rates of infection, an obvious place to look is differences in indoor environments. A plausible hypothesis is the following:

Rates of Covid-19 infection will be higher, other things being equal, in larger households, that is, households with more occupants.

The thinking behind this is simple. If one member of a household becomes infected, and unless there is effective self-isolation within that household, then it is quite likely that their infection will transmit to other members. The larger the household, the more people they can infect. The hypothesis does not imply that household size is the sole or main reason for differences in rates of infection, merely that it is one contributory factor.

*If correct*, the hypothesis suggests a possible link between housing policy and rates of Covid-19 infection. Countries (such as the UK) with restrictive planning policies that have limited the supply of land for building new homes will have fewer homes than they would otherwise have. This reduced supply of housing will lead to higher costs (whether for ownership or renting). As a consequence, fewer people will be able to afford their own home, and (other things being equal) average household sizes will be larger: young adults, for example, will tend to stay longer with their parents before setting up their own home. Larger households in turn will create more scope for transmission of infection.

But is the hypothesis correct? An ideal test would require large sample data on household size and numbers infected at individual household level. Here I present the results of a ‘quick and dirty’ test based on data currently available at national level.

At the present time, reliable data on total rates of infection since the start of the outbreak is not available. National totals of confirmed cases are incomplete because many cases have not been confirmed by testing, and international comparisons of those totals reflect differences in rates of testing as much as in rates of infection. I therefore used national rates of death from Covid-19 as an, admittedly imperfect, proxy for rates of infection. Even such death rates are unlikely to be perfect for international comparison, since practice in recording the cause of death of patients with multiple conditions may vary. As a proxy for rates of infection, death rates suffer from the limitation that they are also influenced by differences in health systems between countries. Nevertheless, it seems reasonable to assume, at least for the developed countries of Western Europe, that official figures on deaths from Covid-19 are at least of the right order of magnitude.

Average household size was calculated from national statistics for population and numbers of households.

A regression was estimated for the model:

DP = C + (B x PH) + E

where: DP is death rate from Covid-19 per million population; C is the regression constant; B is the slope coefficient; PH is average population per household; and E is the error term. The regression was run on data for 14 Western European countries: Austria, Belgium, Denmark, France, Germany, Italy, Ireland, Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, United Kingdom.

Estimation of the regression was by weighted least squares, with weighting by population (implying that the fitting of the regression line takes more account of data points for countries with larger propulations). The justification for the weighting is that a local random factor affecting the death rate within a region with a population of a few million could have a large effect on the overall death rate of a country with a smaller population. Within a larger country, however, the effect of such a local factor would be less, and different random factors within different regions of the country would probably tend to offset each other. It is proper to record that the choice of weighted least squares, rather than ordinary (ie unweighted) least squares, makes a large difference to the result.

The estimated regression line was:

DP = -1,334 + (770 x PH)

The precise values of the estimated coefficients, which rather implausibly imply a nil death rate at a household size of about 1.7, are not important. What does matter is that the estimated slope coefficient is positive, consistently with the hypothesis (and is sufficiently large that the null hypothesis that its true value is zero or less is rejected at the 5% significance level (3)).

I would describe this result as ‘interesting’. But no more conclusion should be drawn than that the hypothesis merits further research.

*A spreadsheet containing the underlying data and full regression output may be downloaded here:*

**Notes and References**

- Sandhu, S (11/5/2020) Why you are less likely to catch coronavirus outside than indoors, according to experts
*i*https://inews.co.uk/news/coronavirus-catch-outside-indoors-why-get-covid-19-explained-2848865 - Moffitt, M (28/4/2020) China study suggests outdoor transmission of COVID-19 may be rare
*SFGATE*https://www.sfgate.com/science/article/China-study-suggests-outdoor-transmission-of-15229649.php - This can be inferred from the fact that the 95% confidence limits of the estimated slope coefficient are both positive.