Using generalized logistics regression to forecast
population infected by Covid-19
Villalobos Arias, Mario Alberto∗
April 5, 2020
Abstract
In this work, a proposal to forecast the populations using generalized logistics regres-
sion curve tting is presented. This type of curve is used to study population growth,
in this case population of people infected with the Covid-19 virus; and it can also be
used to approximate the survival curve used in actuarial and similar studies.
Keywords: Optimization heuristics, generalized logistic regression, curve tting, covid-19
1 Introduction
Population growth curves follow the well-known logistic behavior as shown in the gure 1.
Figure 1: casos totales de contagiados en China
∗Universidad de Costa Rica, CIMPA y Escuela de Matemática, San José, Costa Rica,
mario.villalobos@ucr.ac.cr
Instituto tecnologico de Costa Rica, Escuela de Matemática, Cartago, Costa Rica, marvillalobos@itcr.ac.cr
1
A model that is used to t population curves is the Logistic one, which uses the following
equation
1
P (t) = − (1.1)1 + e at+b
As you can see this model has several problems among them that the data is in [0, 1], and
it is not exible, the advantage is that this model is that an approximation of the optimal
solution can be obtained by transforming the data and use linear regression.
As seen in the graph 2 applying logistic regression and dividing all the data by the maximum
value (66818) a curve that ts well is obtained, with a R2 = 0.99955, which is very good ,
from the statistical point of view, but as seen in the gure there are many values that do
not t very well in the curves and it is not good for prediction, as we will see later.
Figure 2: t with logistic regression, total of infected in China
2 SIR model
The classic version for the study of epidemics is the SIR model in which the population
is divided into three groups: the susceptible, the infected and the recovered (SIR), this in
the simplest case and that the population is changing from susceptible infected and then
recovered.
The decrease in susceptibles is assumed to be proportional to the number of infected multi-
plied by the same number of susceptibles. The change in the recovered is equal to a certain
percentage of the infected and nally the number of infected is going to change increasing
by the susceptibles that are infected and then we take away the amount of the infected that
is recovered. In this way the following equations are obtained:
2
dS −βIS= ,
dt N
dI βIS
= − γI,
dt N
dR
= γI,
dt
with
dS dI dR
+ + = 0,
dt dt dt
which gives
S(t) + I(t) +R(t) = constant = N,
The problem with the SIR model is that, as we see, the three parameters that are in the
dierential equations are needed, but with few data or with a few days it is very dicult to
determine those parameters.
3 Generalized Logistic Regression
Therefore, the use of a slightly more complex model to the logistic regression and that is
easier to determine the parameters than the SIR are proposed.
A rst version is:
M
P (t) = − (3.1)1 + e at+b
In this one more parameter is added, to be determined, which is the population limit M ,
since with this modication it cannot be solved by transformation and linear regression. To
solve this problem, nonlinear optimization techniques must be used, it also has the stiness
problem, that is, it does not t suciently to certain parts of the curve, this model can NOT
be used for prediction, but from the point of adjustment it does not it is as accurate as we
will see later.
To make the curve more exible, an extra parameter α is added as follows:
M
P (t) = ( ) . (3.2)
1 + e−at+b
α
This parameter α adds exibility in tting the curve, remember the graphs of y = x1/3,
y = x1/2, y = x,y = x2,y = x3.
For this function the inection point is obtained when
P ′′(t) = a2ceat+b(eat+b + 1)−c−2(ceat+b − 1) = 0
so the inection point is obtained for
− ln(α) + bt =
a
3
This is the same model known as the Richards curve that is used to model population growth.
K − A
Y (t) = A+ (3.3)
(C +Qe−Bt)1/ν
Note that with some calculations and A = 0 equality with generalized logistics is given
The case:
K
Y (t) =
(1 +Qe−αν(t−t0))1/ν
which is a solution of the dierential equat(ion: ( )ν)
Y ′
Y
(t) = α 1− Y
K
3.1 Gompertz Function
It owes its name to Benjamin Gompertz, the rst to work in this type of function is a
particular case of Richards, and has the following equation:
−ct
G(t) = ae−be ,
Furthermore, its second derivative is:
G′′
cx
(t) = bc2ebe +cx(1 + becx) = 0
which gives us the tipping point is reached in t = −log(−b))/c what in the case of epidemics
tells us at what point the growth of daily cases will start to decrease.
This is a simpler function since it only has 3 parameters instead of the LG, which has 4, and
therefore it will have less local optimums.
3.2 Other versions
Other more complex versions are:
M + ct
P (t) = ( )α . (3.4)
1 + e−at+b
At the time of this writing, the South Korea covid-19 data has to be tted with a curve like
this.
And the next one that was used to adjust the survival curves
P (t) = ( M ) , (3.5)
1 + e−at2+bt+c
α
4
4 Hypothesis and proposal
The proposal of this work is, rst, to use the generalized logistic curve or the Gompertz
curve to make an adjustment of the data in which the curve is almost complete, for example
data from China or South Korea, (March, 30)
On the other hand, when all the data are available, there is a generalized logistic regression
or Gompertz type curve that ts the data.
So the hypothesis here is: If we have the lower part of the curve, that is, the rst values of
the curve, we can obtain the parameters of the curve, and obtain the complete curve. And
with this it can be used to predict population growth, in this case, for example, the total
number of cases by covid-19 in a country or region and when the inection point is reached,
that is, when that the number of daily cases begins to decrease.
More specically, if you have the data for about 20, 30 or 35 days, the question is:
Can we determine the curve parameters that t the complete data?
If this will be achieved as we see, we would have a way of predicting the behavior of population
growth with just having a few days, that is, we can determine the parameters of the curve
with a few days, which is very dicult with the SIR model to determine the parameters of
the dierential equations.
5 Curve Fitting
For this work, the data provided by the European Centre for Disease Prevention and Con-
trol are being used, on the website to download the daily data, see [1].
The rst thing that is presented is to verify, as is known, that the LG curve ts the data
from covid-19 very well. In the case of China and South Korea, which are the ones that have
the almost complete curve.
Using a nonlinear optimization algorithm, the following results are obtained.
5.1 China: original data
For the China data, the following parameters are obtained, and the graph in gure 3.
fecha M a b α
31/03/2020 81149 −0.2348563 9.89996092 0.8809852
Con R2 = 0.998878. As it is observed, a very good approximation is obtained, but as we
know and it is seen in the gure ref ChinaOri, there is a jump in those data, so it was decided
to correct that jump by putting the data of the day 02/13/2020 equal to the previous day
and that of 02/14/2020 equal to the following day. With these corrections, the following
data and the graph in gure 4 are obtained.
fecha M a b α
31/03/2020 66871 −0.15032863 4.21103499 4.436336
In this case, a R2 = 0.999885 is obtained, which is better than the data without correction.
5
Total cases China
90,000
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Figure 3: ting with logistic regression, total of infected in China
Note that the latest data shows that there is a linear trend, this can be improved as seen in
the case of South Korea, as seen in the following subsection.
Total cases China
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Figure 4: Fit with generalized logistic regression, China corrected data
5.2 South Korea: original data
Applying the method to the data for South Korea gives the following data and the graph in
gure 5.
fecha M a b α
31/03/2020 10079 −0.16416438 0.18353146 874.692503
And we get a R2 = 0.99875, which is not very good as seen in gure 5 that the t is not
very good.
To improve this t we propose to use the t curve (3.4), that is:
( M + ctP (t) = ) .
1 + e−at+b
α
6
casos South Korea
12000
10000
cases acum
8000
ajuste de curva
6000
4000
2000
0
0 10 20 30 40 50 60 70 80
Figure 5: Fit with GLR, South Korea data
This curve improves the t by obtaining the following data:
fecha c M a b α
31/03/2020 92.63407116 3046 −0.36922771 15.5617475 0.9691940884
And now the R2 = 0.99988 which is better than the previous result and as seen in the graph
in gure 6.
cases South Korea
10000
9000
8000
7000
6000 cases
acum
5000 ajuste de
4000 curva
3000
2000
1000
0
20 30 40 50 60 70
Figure 6: t with GLR, infected in South Korea with linear nal trend.
As seen, the t with this type of generalized logistic curves is very good, obtaining R2 greater
than 0.9999.
7
6 Prediction tests
As we saw in the previous section, the logistic curves approximate very well, as has been
shown in previous works (for example [2, 4]).
In this section we will try to measure the relative error percentage that is made when using
the proposed curves, a test curve will be used and the data from China and South Korea
Test curve
To try to validate the hypothesis we are going to generate a curve with values
M a b α
2000 −0.05 −2 10
and when using the optimization algorithm, the following upper bounds are obtained for the
relative error When using the LG function, with the indicated number of days:
Relative error percentage in test curve using GLR
number of days max rel. error
25 10.451%
30 4.801%
35 1.825%
40 1.188%
Using the Gompertz function improves the prediction, as mentioned by having fewer param-
eters, when using the example with
a b c
2000 −2 10
When executing the method the following relative error percentages are obtained
Relative error percentage on test curve
using Gompertz
number of days max rel. error
25 1.6697%
30 0.7492%
35 0.8928%
40 0.3234%
As seen with the Gompertz function, it seems that better predictions can be guaranteed.
8
China Data
Let us now look at the case of China, where the curve is almost complete.
As we see when using the days from 16 to 45 (20 days), note that the rst 16 days there
were almost no changes in the data, so that data was not used, with these data with GLR
we obtain:
Gompertz
When using the days from 16 to 45 (20 days) a maximum error of 9.91 % is obtained (average
of 3 runs, basically gives the same value)
When using the days from 15 to 55 (30 days), the graph of gure 7 is obtained. When
measuring the relative errors, a maximum value of 4.49 % (average of 3 runs) is obtained, in
the prediction of the data until 03/31/2020.
Relative error percentage prediction China
using Gompertz
number of days max rel. error
20 9.91%
25 4.49%
cases China
80000
70000
60000 Datos reales
APROX
50000
40000
30000
20000
10000
0
0 10 20 30 40 50 60 70 80 90 100
-10000
Figure 7: 30 day forecast with Gompertz, China
9
South Korea Data
For South Korea we will use the data from day 28, since the rst case (16/02/202), in this
case remember that this data has a linear trend in last days, see gure 8 .
Relative error percentage in South Korea
prediction using Gompertz
number of days máx error rel.
20 17.83%
25 6.49%
12000 cases South Korea
10000
Datos reales
8000 APROX
6000
4000
2000
0
20 30 40 50 60 70
Figure 8: 25 days forecast with Gompertz, South Korea,
7 Results
As results of this work and as we have seen, this method will be used to give a prediction
for the data of some countries starting with Costa Rica and then the data for Italy, Spain
will be presented, and with the adjustments that we have, we already have that of China.
and South Korea.
7.1 China and South Korea
In the case of China and South Korea, the results obtained in the subsections 5.1 and 5.2
can be used.
10
7.2 Costa Rica
For Costa Rica, the following data is available from March 6, when the rst case was detected.
Covid-19 data in Costa Rica
daily Total daily Total daily Total
Date cases cases Date cases cases Date cases cases
06/03/20 1 1 15/03/20 8 35 24/03/20 19 177
07/03/20 4 5 16/03/20 6 41 25/03/20 24 201
08/03/20 4 9 17/03/20 9 50 26/03/20 30 231
09/03/20 4 13 18/03/20 19 69 27/03/20 32 263
10/03/20 4 17 19/03/20 18 87 28/03/20 32 295
11/03/20 5 22 20/03/20 26 113 29/03/20 19 314
12/03/20 1 23 21/03/20 4 117 30/03/20 16 330
13/03/20 3 26 22/03/20 17 134 31/03/20 17 347
14/03/20 1 27 23/03/20 24 158 01/04/20 28 375
06/03/20 1 1 15/03/20 8 35 24/03/20 19 177
07/03/20 4 5 16/03/20 6 41 25/03/20 24 201
08/03/20 4 9 17/03/20 9 50 26/03/20 30 231
09/03/20 4 13 18/03/20 19 69 27/03/20 32 263
10/03/20 4 17 19/03/20 18 87 28/03/20 32 295
11/03/20 5 22 20/03/20 26 113 29/03/20 19 314
12/03/20 1 23 21/03/20 4 117 30/03/20 16 330
13/03/20 3 26 22/03/20 17 134 31/03/20 17 347
14/03/20 1 27 23/03/20 24 158 01/04/20 28 375
With these data when making the adjustment with the GL function, the following parameters
are obtained, and with this amount of data, basically there is only one optimum:
Parameters obtained for Costa Rica
using Generalized Logistics
M a b c
886 -0.0780135 -4.91608521 1023.90189
This curve t has a R2 = 0.99858343.
In the graph in gure 9 you can see the prediction of cases, you can see the number of daily
cases in the upper box and the adjustment of the real data in the lower box.
Upper and lower bounds are included for the forecast, with an error of 7 % in the data, base
on the results in the section 6.
It is also seen that if the situations are as they are today there will be a limit of approximately
1000 cases, that is, from the mathematical point of view or curve tting.
Gompertz
With the Gompertz curve, basically the same result is obtained as seen in the following:
11
Figure 9: Forecast with LG, Costa Rica
Parameters obtained for Costa Rica
using Gompertz
a b c
887 -6.923348908 -0.077891632
This t has a R2 = 0.99853504, in the gure 10 you can see the resulting graph.
It should be noted that the results of using these 2 functions have been approximating each
other as the days go by and the maximum value between them is presented in the following
table for the sample.
Furthermore, it can be seen that as the days go by, the limit value of M decreases, possibly
due to the measures taken by the country's Government.
12
Figure 10: Forecast with Gompertz, Costa Rica
Limit value of M of the forecast according
to day, Costa Rica
Date Gompertz LG
23/03/20 2162 1983
24/03/20 1445 1359
25/03/20 1310 1274
26/03/20 1583 1505
27/03/20 2048 1974
28/03/20 2426 2343
29/03/20 1743 1730
30/03/20 1193 1185
31/03/20 924 922
01/04/20 887 886
02/04/20 829 828
03/04/20 788 789
04/04/20 761 761
05/04/20 744 743
Furthermore, it can be seen that as the days go by, the limit value of M decreases, possibly
due to the measures taken by the country's Government, in the gure 11 this is show.
13
2500 Datos reales
Evolution of aprox Ult APROX
2250 Costa Rica 23-mar
24-mar
2000 25-mar
26-mar
1750 27-mar
28-mar
1500 29-mar
30-mar
1250 31-mar
01-abr
1000 02-abr
03-abr
750
04-abr
500
250
0
0 10 20 30 40 50 60 70 80 90 100
Figure 11: Costa Rica, forecast evolution
14
7.3 Italy
For Italy, the results obtained using Gompertz and with the data as of 3/30/2020 are pre-
sented.
Parameters obtained for Italy
using Gompertz
a b c
261 052 -43.77482253 -0.063450536
This t has a R2 = 0.99968, in the gure 12 you can see the resulting graph.
Cases - Italia
300000
250000
Datos Reales
Aprox
200000
inferior
150000 superior
Pto de inflex
100000
50000
0
0 20 40 60 80 100 120 140
-50000
Figure 12: Italy, forecast with Gompertz,
15
7.4 Spain
For Spain, the results are presented when using the Gompertz function and with the data
as of 3/31/2020.
Parameters obtained for Spain
using Gompertz
a b c
468 495 -69.20043418 -0.061995389
This t has a R2 = 0.999410067, in the gure 13 you can see the resulting graph.
España-cases,  Gompertz
550000
500000
Datos reales
450000
400000 APROX
350000 inf
300000 sup
250000
200000
150000
100000
50000
0
20 40 60 80 100 120 140
Figure 13: Spain, Forecast with Gompertz.
16
7.5 USA
For USA, the results are presented when using the Gompertz function and with the data as
of april 5,2020.
Parameters obtained for USA
using Gompertz
a b c
1 486 347 -253.23772 -0.066899181
This t has a R2 = 0.999915, in the gure 14 you can see the resulting graph.
Figure 14: USA, Forecast with Gompertz.
7.6 Other countries
There are the results of other countries that will be incorporated later
8 Conclusions and future work
This work shows how population growth curves can be adjusted, using the LG and Gompertz
functions, even in the most general case, such as the data for South Korea. section 5.2, where
logistics is being carried out with a straight line.
It is seen that these methods could be used to predict the growth of populations, in this case
of people infected with Covid-19, and could help pandemic experts to take the necessary
measures.
17
8.1 Future work
More studies are needed to rene the results of this work.
See the possibility of using similar curves to approximate other types of data.
References
[1] Download today's data on the geographic distribution of COVID19
cases worldwide, European Centre for Disease Prevention and Control,
https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-
distribution-covid-19-cases-worldwide
[2] Fekedulegn, Desta; Mairitin P. Mac Siurtain; Jim J. Colbert (1999). Parameter Estima-
tion of Nonlinear Growth Models in Forestry (PDF). Silva Fennica. 33 (4): 327-336.
Archived from the original (PDF) on 2011-09-29.
[3] Kirckpatrick, S.; Gelatt, J. Vecchi, P. 1983. Optimization by simulated annealing,
Revista Science. Volumen 220, número 4598.
[4] Pella, J. S.; Tomlinson, P. K. (1969). A Generalised Stock-Production Model. Bull.
Inter-Am. Trop. Tuna Comm. 13: 421-496.
[5] Richards, F. J. (1959). A Flexible Growth Function for Empirical Use. Journal of
Experimental Botany. 10 (2): 290-300. doi:10.1093/jxb/10.2.290
18