Let's consider the following SIRD Model
where
-
$S(t)$ is the number of susceptible people, -
$I(t)$ is the number of infected, -
$R(t)$ is the number of recovered. -
$D(t)$ is the number of deceased.
Close related, we define
-
$\alpha(t)$ as the rate of infection, -
$\beta(t)$ as the rate of recovery, and -
$\gamma(t)$ as the rate of mortality.
Another important measure is the basic reproduction number:
The parameter
A practical way to model the behavior of COVID-19 is using time series as in Maleki2020, but this approach neglects the underlying mathematical models. In Andrade2021, there is some work done attempting to find an underlying model previous to fitting a time series for forecasting deaths, but not the one given by SIR model. In recent years, methods coming from machine or deep learning have been used to find more accurate predictions using time series. For example, Singh tried to use support vector machines, and Hawas2020 recurrent neural networks. However, the mathematical models are also overlooked.
However, the main problem seems to be that the classical models, as the one given above, are is too rigid to be used in the current scenario. So there is a necessity to reformulate some assumptions of the model. As shown in wacker2020time, under certain technical assumptions, an analytical solution could be obtained if we do not assume that the parameters in the SIR model are constant. However, there is no given way to model these time-variable parameters.
In the present article, we shall inspect a possible and promising solution by using the above ideas to model time-dependent parameters in the SIR model as time series. As we want to make this approach as affordable for most people as possible, we will employ some tools from machine learning to give highly accurate predictions of the pandemia. To illustrate this idea, we have use data from the Our World in Data project.
We will analyze the following discrete generalization of the SIR model.
Define
For the sake of simplicity, we consider a fixed total population
From here, que denote the first (backward) difference
From the discrete model above, it follows that
The main idea is to obtain the time series for epydemics
which is publicly available in GitHub, and it could be installed from PyPI.
We use the data from the Our World in Data project. The data is available in the data_sample
folder. The data is processed using the process_data_from_owid
function. The function returns a DataContainer
object. The DataContainer
object contains the data and the information about the data. The DataContainer
object is used to create a Model
object. The Model
object is used to create a model, fit the model, forecast the model, run simulations, and generate results. The Model
object is also used to evaluate the forecast. The Model
object is used to visualize the results.
# !pip install epydemics
import matplotlib.pyplot as plt
from epydemics import process_data_from_owid, DataContainer, Model
To make the exposition clearer, warnings
is used to suppress warnings.
import warnings
warnings.filterwarnings('ignore')
At first, we retrieve the global data from the owid-covid-data.csv
file. The data is processed using the process_data_from_owid
function. If no argument is passed to the function, the function retrieves the data from the owid-covid-data.csv
file. The object global_dataframe
is just a Pandas DataFrame object containing the raw data from the owid-covid-data.csv
file.
Other sources could be used as long as they have the same structure as the owid-covid-data.csv
file. By default, the retrieve data is filtered to make use only of global data, by setting the parameter iso_code
to OWID_WRL
. The iso_code
parameter could be used to filter the data by country. For example, iso_code="MEX"
retrieves the data for Mexico.
global_dataframe = process_data_from_owid()
global_dataframe.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
C | D | N | |
---|---|---|---|
date | |||
2020-01-05 | 2.0 | 3.0 | 7975105024 |
2020-01-06 | 2.0 | 3.0 | 7975105024 |
2020-01-07 | 2.0 | 3.0 | 7975105024 |
2020-01-08 | 2.0 | 3.0 | 7975105024 |
2020-01-09 | 2.0 | 3.0 | 7975105024 |
Using the global_dataframe
, we create a DataContainer
object. The DataContainer
object contains the data and the information about the data. The DataContainer
object is used to create a Model
object. As soon as the raw data is received by DataContainer
, it is processed to create the DataContainer
object. The DataContainer
object contains the data and the information about the data. The DataContainer
object is used to create a Model
object.
global_data_container = DataContainer(
global_dataframe
)
print(
f"Global data container has {global_data_container.data.shape[0]} rows and {global_data_container.data.shape[1]} columns.")
print(f"Global data container has {global_data_container.data.isna().sum().sum()} missing values.")
Global data container has 1677 rows and 20 columns.
Global data container has 0 missing values.
The attribute data
from a DataContainer
object is just a Pandas DataFrame object containing the processed data. Because of this, we can use the Pandas DataFrame methods to visualize the data.
global_data_container.data[["C", "D", "N"]].plot(
subplots=True
)
plt.show()
The dictionary containing the meaning of every label could be retrieved from the compartment_labels
attribute from the module itself.
from epydemics import compartment_labels
compartment_labels
{'A': 'Active',
'C': 'Confirmed',
'S': 'Susceptible',
'I': 'Infected',
'R': 'Recovered',
'D': 'Deaths'}
global_data_container.data[["A", "S", "I", "R"]].plot(
subplots=True
)
plt.show()
As it was stated in the introduction, the non-constant but time-depending nature of the rate is the core of this model.
global_data_container.data[["alpha", "beta", "gamma"]].plot(
subplots=True
)
plt.show()
Create a model using the global_data_container
object, using information from March 01, 2020, to December 31, 2020.
global_model = Model(
global_data_container,
start="2020-03-01",
stop="2020-12-31",
)
In the following, we apply these methods to create and to a time series model for the logit of the rates
global_model.create_logit_ratios_model()
global_model.fit_logit_ratios_model()
Now that we have a model these rate, we can adjust the numbers of days (steps
) to forecast. The forecast_logit_ratios
method returns a Pandas DataFrame object containing the forecasted logit ratios. The forecasting_interval
attribute contains the forecasting interval.
global_model.forecast_logit_ratios(steps=30)
global_model.forecasting_interval
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
'2021-01-09', '2021-01-10', '2021-01-11', '2021-01-12',
'2021-01-13', '2021-01-14', '2021-01-15', '2021-01-16',
'2021-01-17', '2021-01-18', '2021-01-19', '2021-01-20',
'2021-01-21', '2021-01-22', '2021-01-23', '2021-01-24',
'2021-01-25', '2021-01-26', '2021-01-27', '2021-01-28',
'2021-01-29', '2021-01-30'],
dtype='datetime64[ns]', freq='D')
Run the simulations and generate the results. The generate_result
method returns a Pandas DataFrame object global_model.results
containing the results.
global_model.run_simulations()
global_model.generate_result()
Finally, we can visualize the results. The visualize_results
method returns a Matplotlib Figure object. At first, create a testing dataset using global data container and the global model forecasting interval. The global_testing_data
is a Pandas DataFrame object containing the testing data.
global_testing_data = global_data_container.data.loc[global_model.forecasting_interval]
for compartment in ["C", "D", "I"]:
global_model.visualize_results(
compartment,
global_testing_data,
log_response=True)
The gray dotted lines are several forecasting depending on the confidence interval for the time series model for the logit of the rates
A very peculiar feature of this model is that the forecasting is not a single value but a distribution. For example, although the averages of forecasted deaths are not so close to the actual data, the lower forecasting series are very close to the actual data.
A tool for evaluare forecast in a more rigours manner is provided, using several criteria, and this analysis could be saved for further analysis.
import json
evaluation = global_model.evaluate_forecast(global_testing_data, save_evaluation=True, filename="global_evaluation")
for category, info in evaluation.items():
print(category, info['mean']['smape'])
C 2.197310076754546
D 49.4142982150385
I 15.371809526744256
Since this is a very new model, there are many things to do. For example, we could try to use other time series models for the logit of the rates
Allen u.a. 2008 Allen, L.J.S. ; Brauer, F. ; Driessche, P. van den ; Bauch, C.T. ; Wu, J. ; Castillo-Chavez, C. ; Earn, D. ; Feng, Z. ; Lewis, M.A. ; Li, J. u.a.: Mathematical Epidemiology. Springer Berlin Heidelberg, 2008 (Lecture Notes in Mathematics).– URL https://books. google.com/books?id=gcP5l1a22rQC.– ISBN 9783540789109
Andrade u.a. 2021 Andrade, Marinho G. ; Achcar, Jorge A. ; Conce icc˜ ao, Katiane S. ; Ravishanker, Nalini: Time Series Regression Models for COVID-19 Deaths. In: J. Data Sci 19 (2021), Nr. 2, S. 269–292
Hawas 2020 Hawas, Mohamed: Generated time-series prediction data of COVID-19s daily infections in Brazil by using recurrent neural networks. In: Data in brief 32 (2020), S. 106175
Maleki u.a. 2020 Maleki, Mohsen ; Mahmoudi, Mohammad R. ; Wraith, Darren ; Pho, Kim-Hung: Time series modelling to forecast the confirmed and recovered cases of COVID-19. In: Travel medicine and infectious disease 37 (2020), S. 101742
Martcheva 2015 Martcheva, M.: An Introduction to Mathematical Epi demiology. Springer US, 2015 (Texts in Applied Mathematics).– URL https: //books.google.com/books?id=tt7HCgAAQBAJ.– ISBN 9781489976123
Singh u.a. 2020 Singh, Vijander ; Poonia, Ramesh C. ; Kumar, Sandeep ; Dass, Pranav ; Agarwal, Pankaj ; Bhatnagar, Vaibhav ; Raja, Linesh: Prediction of COVID-19 coronavirus pandemic based on time series data using Support Vector Machine. In: Journal of Discrete Mathematical Sciences and Cryptography 23 (2020), Nr. 8, S. 1583–1597
Wacker und Schluter 2020 Wacker, Benjamin; Schluter, Jan: Time continuous and time-discrete SIR models revisited: theory and applications. In: Advances in Difference Equations 2020 (2020), Nr. 1, S. 1–44.– ISSN 1687-1847.– URL https://doi.org/10.1186/s13662-020-02907-9