[Math Lair] Coronavirus and Inferential Statistics

Math Lair Home > Topics > Coronavirus and Inferential Statistics

With the Coronavirus pandemic in 2020, the question has been raised as to whether the virus is seasonal; in other words, does Coronavirus spread more slowly as the temperature increases? I've decided to investigte this question as an example of regression analysis. This example should not be used as a good example of methodology in terms of selecting appropriate data to answer questions (there's quite a large number of methodological flaws in the choice of data); it's more a mathematical demonstration.

Here are the raw data. On March 3, there were 52 countries that had reported at least 2 cases of the virus. These are listed along with the average temperatures of their capital cities in March (except for China, where I used Wuhan's average temperature, and Italy, where I used Milan's). Under the assumption that the number of Coronavirus cases grows exponentially, we'll look at the natural logarithm of the number of cases (specifically, the difference in the logarithms) in order to be able to use linear regression.
CountryAverage Temperature in MarchCases, March 3ln (Cases, March 3)Cases, March 25ln (Cases, March 25)ln (Cases, March 25) − ln (Cases, March 3)
China6.78030411.29357471189478184811.31261914755880.0190444356640604
South Korea5.748128.4788680770945791379.120087382998620.641219305904052
Australia17.6333.4965075614664822527.719573989259584.2230664277931
Malaysia27.6293.3672958299864716247.392647520721624.02535169073515
Japan8.72685.5909869805108611937.084226422097921.49323944158706
Singapore28.01084.682131227124225586.324358962381311.64222773525709
Philippines28.731.098612288668115526.313548046277095.21493575760898
Vietnam20.0162.772588722239781344.897839799950912.12525107771113
New Zealand15.820.6931471805599451895.241747015059644.5485998344997
Italy9.020367.618742377670416917611.14440926064033.52566688296985
Spain11.21144.73619844839453967310.58842613454625.85222768615169
Germany5.11575.056245805348313155410.35945564281745.30320983746909
France8.81915.25227342804663220259.999933450804384.74766002275775
Switzerland5.3303.4011973816621687899.081256218564655.68005883690249
United Kingdom6.9393.6635616461296580818.997270906233455.3337092601038
Netherlands6.1182.8903717578961655608.623353387244635.73298162934846
Austria5.7182.8903717578961652828.572060092857085.68168833496091
Belgium6.882.0794415416798442698.359134886757966.27969334507813
Norway3.3253.218875824868225667.850103545175584.63122772030738
Sweden0.1152.7080502011022122727.728415779841045.02036557873883
Portugal14.920.69314718055994523627.767263996757317.07411681619736
Denmark3.551.609437912434115917.372118028337795.76268011590369
Czechia (Czech Republic)3.631.0986122886681113947.239932591320476.14132030265236
Israel15.0102.3025850929940513297.192182058713254.8895969657192
Finland-1.371.945910149055317926.674561391814434.72865124275911
Greece13.271.945910149055317436.610696044717764.66478589566245
Iceland0.592.197224577336226486.473890696352274.27666611901605
Russia-1.031.098612288668116586.489204931325325.39059264265721
Romania5.431.098612288668117626.635946555686655.53733426701854
Croatia6.482.079441541679843825.945420608606583.86597906692674
San Marino10.082.079441541679841875.231108616854593.15166707517475
Azerbaijan7.031.09861228866811874.465908118654583.36729582998647
Georgia6.631.09861228866811734.290459441148393.19184715248028
Thailand29.5433.761200115693569346.839476438228843.07827632253528
Indonesia26.420.6931471805599456866.530877627725885.83773044716594
India22.151.60943791243415626.331501849893694.72206393745959
Iran10.015017.313886831633462481110.11904238220172.8051555505682
Pakistan17.051.60943791243419916.898714534329995.28927662189589
Qatar17.071.945910149055315266.265301212737714.3193910636824
Bahrain21.2493.891820298110633925.971261839790462.07944154167984
Egypt16.920.6931471805599454025.996452088619025.30330490805908
Lebanon16.0132.564949357461543045.717027701406223.15207834394469
Iraq16.6263.258096538021483165.755742213586912.49764567556543
Kuwait19.3564.025351690735151955.272999558563751.2476478678286
United Arab Emirates22.3213.044522437723422485.513428746164982.46890630844156
Oman25.061.79175946922805994.595119850134592.80336038090653
United States8.3644.158883083359675191410.85734378229646.69846069893675
Canada-2.2273.2958368660043317397.461065514354284.16522864834995
Brazil21.520.69314718055994522017.696667081526467.00351990096652
Ecuador14.361.7917594692280510496.95559260839635.16383313916824
Mexico18.151.60943791243413705.913503005638274.30406509320417
Algeria13.251.60943791243412645.575949103146323.96651119071222

Plotting this data, we get:

[Graph of average temperature versus growth in cases]

In the linear regression model, the line of best fit has the form

y = a + bx

Where, y is the dependent variable (in this case, the growth in cases), x is the independent variable (in this case, average temperature), a is the y-intercept, and b is the slope. a and b are calculated as follows:

a = y)(Σx²) − (Σx)(Σxy)nx²) − (Σx
b = nxy) − (Σx)(Σy)nx²) − (Σx

Looking at our data again:
Row #xyxyx²
16.70.01904443566406040.12759771894920544.89
25.70.6412193059040523.654950043653132.49
317.64.223066427793174.3259691291586309.76
427.64.02535169073515111.09970666429761.76
58.71.4932394415870612.991183141807475.69
628.01.6422277352570945.9823765871985784
728.75.21493575760898149.668656243378823.69
820.02.1252510777111342.5050215542226400
915.84.548599834499771.8678773850953249.64
109.03.5256668829698531.731001946728781
1111.25.8522276861516965.5449500848989125.44
125.15.3032098374690927.046370171092426.01
138.84.7476600227577541.779408200268277.44
145.35.6800588369024930.104311835583228.09
156.95.333709260103836.802593894716247.61
166.15.7329816293484634.971187939025637.21
175.75.6816883349609132.385623509277232.49
186.86.2796933450781342.701914746531346.24
193.34.6312277203073815.283051477014410.89
200.15.020365578738830.5020365578738830.01
2114.97.07411681619736105.404340561341222.01
223.55.7626801159036920.169380405662912.25
233.66.1413203026523622.108753089548512.96
2415.04.889596965719273.343954485788225
25−1.34.72865124275911−6.147246615586841.69
2613.24.6647858956624561.5751738227443174.24
270.54.276666119016052.138333059508020.25
28−1.05.39059264265721−5.390592642657211
295.45.5373342670185429.901605041900129.16
306.43.8659790669267424.742266028331140.96
3110.03.1516670751747531.5166707517475100
327.03.3672958299864723.571070809905349
336.63.1918471524802821.066191206369843.56
3429.53.0782763225352890.8091515147908870.25
3526.45.83773044716594154.116083805181696.96
3622.14.72206393745959104.357613017857488.41
3710.02.805155550568228.051555505682100
3817.05.2892766218958989.9177025722301289
3917.04.319391063682473.4296480826008289
4021.22.0794415416798444.0841606836126449.44
4116.95.3033049080590889.6258529461984285.61
4216.03.1520783439446950.433253503115256
4316.62.4976456755654341.4609182143861275.56
4419.31.247647867828624.079603849092372.49
4522.32.4689063084415655.0566106782468497.29
4625.02.8033603809065370.0840095226633625
478.36.6984606989367555.59722380117568.89
48−2.24.16522864834995−9.163503026369894.84
4921.57.00351990096652150.57567787078462.25
5014.35.1638331391682473.8428138901058204.49
5118.14.3040650932041777.9035781869955327.61
5213.23.9665111907122252.3579477174013174.24
Σ643.4220.6698559747742591.6955911711111643.76

Plugging all those numbers into the formula above, we get

y = 4.70952257417939 − 0.0376520327674144x

The line of best fit looks like:

[Graph of average temperature versus growth in cases]

The line of best fit is slightly negative. Note that, even though the slope is only slightly negative, because we've taken the logarithm of the ratio of cases, a slight negative might result in a big difference in the growth in cases if it were statistically significant (and this is a question we'll investigate below). So, a hot country (30°) would show only one-third the growth of that of a cold country (0°). Taking the exponential of the line of best fit, we get:

[Graph]

Now to determine whether the result is statistically significant. In the simple linear regression model, a (1 − α) · 100% confidence interval for the slope parameter is given by:

[formula]

Typically we would use a 95% confidence interval, so α would be 0.05. If this interval does not contain 0, we would reject the null hypothesis that b (the slope) is zero (meaning there is no relationship). If the interval does contain zero, there is insufficient evidence for rejecting the null hypothesis.

To find the mean squared error (denoted as MSE in the equation above), we'll find the sum of the squares of (yi^yi and divide by the number of observations less 2 (= 50).
ixi yi ^yi (yi^yi
16.70.01904443566406044.45725395044119.6977036970566
25.70.6412193059040524.49490598321114.8509010068531
317.64.22306642779314.0468467932480.0310533595992085
427.64.025351690735153.6703264655480.126042910519187
58.71.493239441587064.3819498849018.34464802531102
628.01.642227735257093.655265652444.05232165601611
728.75.214935757608983.6289092295012.51548014786225
820.02.125251077711133.95648191463.35340637797271
915.84.54859983449974.1146204522340.188338104231718
109.03.525666882969854.370654275070.714003692808212
1111.25.852227686151694.2878198029762.44737202494224
125.15.303209837469094.5174972028730.61734434416393
138.84.747660022757754.3781846816240.136512027705901
145.35.680058836902494.5099667963191.36911538343684
156.95.33370926010384.4497235438870.781430746475328
166.15.732981629348464.4798451701031.57035098549025
175.75.681688334960914.4949059832111.40845235042505
186.86.279693345078134.4534887471643.33502323344271
193.34.631227720307384.5852708618590.00211203283844446
200.15.020365578738834.7057573667230.0989783270677977
2114.97.074116816197364.1485072817278.55919114818388
223.55.762680115903694.5777404553051.40408199925974
233.66.141320302652364.5739752520282.45657050771668
2415.04.88959696571924.144742078450.554808803088813
25−1.34.728651242759114.7584702126010.000889170962431547
2613.24.664785895662454.2125157374360.204548296022178
270.54.276666119016054.6906965536150.171421200774195
28−1.05.390592642657214.747174602770.4139867740523
295.45.537334267018544.5062015930421.06323459134201
306.43.865979066926744.4685495602720.36309119945035
3110.03.151667075174754.33300224231.39555277708684
327.03.367295829986474.445958340611.16351281182466
336.63.191847152480284.4610191537181.61079756872576
3429.53.078276322535283.5987876032850.270931993387713
3526.45.837730447165943.7155089048724.50382427457647
3622.14.722063937459593.8774126457830.713435804530933
3710.02.80515555056824.33300224232.33431551343581
3817.05.289276621895894.069438012911.48800623197263
3917.04.31939106368244.069438012910.06247652759043
4021.22.079441541679843.9112994752763.35570348887919
4116.95.303304908059084.0732032161871.51315017234655
4216.03.152078343944694.107090045680.912047350451372
4316.62.497645675565434.0844988260182.51810292110125
4419.31.24764786782863.9828383375397.4812669055946
4522.32.468906308441563.8698822392291.96273355864573
4625.02.803360380906533.768221750750.930957463016217
478.36.698460698936754.3970106980095.29667210677034
48−2.24.165228648349954.7923570420940.393290022239993
4921.57.003519900966523.9000038654459.63181178273921
5014.35.163833139168244.1710985013890.98552206104668
5118.14.304065093204174.0280207768630.0762004645842644
5213.23.966511190712224.2125157374360.0605182370087725
Total129.493244162627

Dividing 129.493244162627 by 50, that result is 2.58986488325253; this is MSE in the formula above. Now, we had found Σ(xix)² above; it was 3682.9223. Finally, we need t0.025,50. Consulting a table, t0.025,40 = 2.021 and t0.025,60 = 2.000. Since t0.025,50 would fall between those two values, we'll use 2.01. So, a 95% confidence interval for the slope parameter is

−0.037652 ± (2.01)2.589863682.9223
= −0.037652 ± 0.0533

Since this range includes zero, we cannot conclude, at the 95% confidence interval, that the slope is not zero, and so the nonzero slope that we found is not statistically significant. That doesn't necessarily mean that, in the real world, there is no relationship whatsoever between temperature and spread of the virus, but it does mean that, if there is one, our data don't provide statistically significant evidence of such a relationship.


I had collected these statistics for Coronavirus statistics by country but ended up not using them (for this, at any rate):

March 22 | March 21 | March 20 | March 19 | March 18