DELHI TECHNOLOGICAL UNIVERSITY ECONOMETRICS (HU-351) PROJECT REPORT Case Study on USD – INR conversion rates using linear and polynomial regression SUBMITTED TO: SUBMITTED BY: Ms. Reenu Ahluwaliya Siddhant Vijay (2K20/CH/065) DELHI TECHNOLOGICAL UNIVERSITY (Formerly Delhi College of Engineering) Bawana Road, Delhi-110042 CANDIDATE’S DECLARATION I, SIDDHANT VIJAY (Roll No. 2K20/CH/065), student of B.Tech. (CHEMICAL ENGINEERING), hereby declare that the project Dissertation titled “Modeling COVID-19 daily cases in Senegal using a generalized Waring regression model” which is submitted by me to the Department of CHEMICAL ENGINEERING, Delhi Technological University, Delhi in partial fulfilment of the requirement for the award of the 5TH semester of the Bachelor of Technology, is original and not copied from any source without proper citation. This work has not previously formed the basis for the award of any Degree, Diploma Associateship, Fellowship or other similar title or recognition. Place: Delhi Date: 19-11-2021 Siddhant Vijay DELHI TECHNOLOGICAL UNIVERSITY (Formerly Delhi College of Engineering) Bawana Road, Delhi-110042 CERTIFICATE I, hereby certify that the Project:Modeling COVID-19 daily cases in Senegal using a generalized Waring regression model, which is submitted by Siddhant Vijay, Roll No – 2K20/CH/065, CHEMICAL ENGINEERING, Delhi Technological University, Delhi in fulfillment of the requirement for the 5TH semester of Bachelor of Technology, is a record of the project work carried out by the students under my supervision. To the best of my knowledge this work has not been submitted in part or full for any Degree or Diploma to this University or elsewhere. Place: Delhi Date: 13-11-2021 Submitted To:Ms. Reenu Ahluwalia DEPARTMENT OF INFORMATION TECHNOLOGY DELHI TECHNOLOGICAL UNIVERSITY (Formerly Delhi College of Engineering) Bawana Road, Delhi-110042 ACKNOWLEDGEMENT I would like to convey my heartfelt thanks to Ms. Reenu Ahluwalia for her ingenious ideas, tremendous help and cooperation. I am extremely grateful to my friends who gave valuable suggestions and guidance for completion of my project. The cooperation and healthy criticism came handy and useful with them. INDEX ✔ Introduction 1. Regression 2. Linear Regression 3. Polynomial Regression 4. Independent and dependant variables 5. Assumptions 6. Hypothesis 7. Cost function 8. Gradient Descent 9. Mean Squared Error 10.R2 Score 11.Overfitting in Regression ✔ Problem Statement ✔ Approach ✔ Dataset ✔ Mechanics of Regression ✔ Performing Linear Regression ✔ Model Slope and Intercept ✔ Performing Polynomial Regression ✔ Graph for Polynomial Regression ✔ Predictions ✔ Regression metrices for model performance ✔ Checking Residual Analysis ✔ Graph for Residual Analysis ✔ Interpretation and conclusions ✔ References INTRODUCTION 1. Regression: Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables). Regression helps investment and financial managers to value assets and understand the relationships between variables, such as commodity prices and the stocks of businesses dealing in those commodities. 2. Linear Regression: Linear regression is a linear approach to modelling the relationship between the scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. ● Simple linear regression: Y = a + bX + u Multiple linear regression: Y = a + b 1X1 + b2X2 + b3X3 + … + btXt + ● u Where: ● Y = the variable that you are trying to predict (dependent variable). ● X = the variable that you are using to predict Y (independent variable). ● a = the intercept. ● b = the slope. ● u = the regression residual. 3. Polynomial Regression: o Polynomial Regression is a regression algorithm that models the relationship between a dependent(y) and independent variable(x) as nth degree polynomial. The Polynomial Regression equation is given below: y= b0+b1x1+ b2x12+ b x 3+ bnx1n o It is also called the special case of Multiple Linear Regression in ML. Because we add some polynomial terms to the Multiple Linear regression equation to convert it into Polynomial Regression. o It is a linear model with some modification in order to increase the accuracy. o The dataset used in Polynomial regression for training is of non-linear nature. o It makes use of a linear regression model to fit the complicated and non-linear functions and datasets. o In Polynomial regression, the original features are converted into Polynomial features of required degree (2, 3…, n) and then modeled using a linear model. Why Polynomial Regression? o In the above image, we have taken a dataset which is arranged non-linearly. So, if we try to cover it with a linear model, then we can clearly see that it hardly covers any data point. On the other hand, a curve is suitable to cover most of the data points, which is of the Polynomial model. o Hence, if the datasets are arranged in a non-linear fashion, then we should use the Polynomial Regression model instead of Simple Linear Regression. 4. Independent and dependant variables: i. Independent variable Independent variable is also called an Input variable and is denoted by X. In practical applications, an independent variable is also called a Feature variable or Predictor variable. We can denote it as: Independent or Input variable (X) = Feature variable = Predictor variable ii. Dependent Variable Dependent variable is also called an Output variable and is denoted by y. Dependent variable is also called a Target variable or Response variable. It can be denoted as follows: Dependent or Output variable (Y) = Target variable = Response variable 5. Assumptions: There are 5 key assumptions in Linear Regression Model: i. Linear Relationship between the features and target: According to this assumption there is linear relationship between the features and target. Linear regression captures only linear relationship ii. Little or no Multicollinearity between the features: Multicollinearity is a state of very high inter-correlations or inter associations among the independent variables. It is therefore a type of disturbance in the data if present weakens the statistical power of the regression model. The interpretation of a regression coefficient is that it represents the mean change in the target for each unit change in a feature when you hold all of the other features constant. However, when features are correlated, changes in one feature in turn shifts another feature. The stronger the correlation, the more difficult it is to change one feature without changing another. It becomes difficult for the model to estimate the relationship between each feature and the target independently because the features tend to change in unison. If we have 2 features which are highly correlated, we can drop one feature or combine the 2 features to form a new feature, which can further be used for prediction. iii. Homoscedasticity Assumption: Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the features and the target) is the same across all values of the independent variables iv. Normal distribution of error terms: The fourth assumption is that the error(residuals) follows a normal distribution. v. Little or No autocorrelation in the residuals: Autocorrelation occurs when the residual errors are dependent on each other. The presence of correlation in error terms drastically reduces model’s accuracy. This usually occurs in time series models where the next instant is dependent on previous instant. 6. Hypothesis: 7. Cost Function: Where m is the number of instances present in the dataset. 8. Gradient Descent: Gradient descent is an iterative algorithm which we will run many times. On each iteration, we apply the following “update rule” (the: = symbol means replace theta with the value computed on the right): Alpha is a parameter called the learning rate. The learning rate gives the engineer some additional control over how large of steps we make. Selecting the right learning rate is critical. If the learning rate is too large, you can overstep the minimum and even diverge. In linear regression, the model targets to get the best-fit regression line to predict the value of y based on the given input value. While training the model, the model calculates the cost function which measures the Root Mean Squared error between the predicted value and true value. The model targets to minimize the cost function. Update Rules: This process can be represented in the plot of Cost functions V/S the independent variables as follows: Imagine that we plot our cost function on the z-axis based on Ө0 on the x axis and Ө1 on the y-axis The process is continued until we reach the very bottom of the curve i.e., when the value becomes minimum. The red arrows show the minimum points in the curve. The movement is done in the direction tangential to the curve with the help of derivative of our cost function. Mathematical derivation of this algorithm is as follows: Ө1: Finally, Derivatives: Let’s consider our hypothesis function: And the Cost function: Since we are monitoring the value of cost function at each step using gradient descent it is easy to see if convergence is achieved or not with the help of suitable value of the learning rate which controls how large the steps are. Here epoch represents the number of steps taken in gradient descent. The most commonly used rates are: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3 • If α is very small, it would take long time to converge and become computationally expensive. • If α is large, it may fail to converge and overshoot the minimum. The above image shows the convergence of gradient descent with a learning rate of 0.01 and 5000 epochs. 9. MEAN SQUARED ERROR: In regression analysis, the plotting is a more natural way to view the overall trend of the whole data. The mean of the distance from each point to the predicted regression model can be calculated and shown as the mean squared error. The squaring is critical to reducing the complexity with negative signs. To minimize MSE, the model could be more accurate, which would mean the model is closer to actual data. One example of a linear regression using this method is the least- squares method—which evaluates the appropriateness of linear regression model to model bivariate dataset, but whose limitation is related to the known distribution of the data. 10. R2 Score: R2 Score is another metric to evaluate performance of a regression model. It is also called coefficient of determination. It gives us an idea of goodness of fit for the linear regression models. It indicates the percentage of variance that is explained by the model. Mathematically, R2 Score = Explained Variation/Total Variation In general, the higher the R2 Score value, the better the model fits the data. Usually, its value ranges from 0 to 1. So, we want its value to be as close to 1. Its value can become negative if our model is wrong. 11. OVERFITTING IN REGRESSION Overfitting is where the model is too complex for the data —it happens when the sample size is too small. If enough predictor variables are used in the regression model, we nearly always get a model that looks significant. While an overfitted model may fit the idiosyncrasies of the data extremely well, it won’t fit additional test samples or the overall population. The model’s p-values, R Squared and regression coefficients can all be misleading. How to Avoid Overfitting In linear modelling (including multiple regression), you should have at least 10-15 observations for each term you are trying to estimate. Any less than that, and you run the risk of overfitting your model. “Terms” include: ● Interaction Effects, ● Polynomial expressions (for modelling curved lines), ● Predictor variables. While this rule of thumb is generally accepted, Green (1991) takes this a step further and suggests that the minimum sample size for any regression should be 50, with an additional 8 observations per term. For example, if you have one interacting variable and three predictor variables, you’ll need around 45-60 items in your sample to avoid overfitting, or 50 + 3(8) = 74 items according to Green. How to Detect and Avoid Overfitting The easiest way to avoid overfitting is to increase your sample size by collecting more data. If you can’t do that, the second option is to reduce the number of predictors in your model —either by combining or eliminating them. Factor Analysis is one method you can use to identify related predictors that might be candidates for combining. Cross-Validation Use cross-validation to detect overfitting: this partitions your data generalizes your model and chooses the model which works best. One form of cross-validation is predicted R-squared. best statistical software will include this statistic, which is calculated by: ● Removing one observation at a time from your data, ● Estimating the regression equation for each iteration, ● Using the regression equation to predict the removed observation. Cross-validation isn’t a magic cure for small data sets though, and sometimes a clear model isn’t identified even with adequate sample size. Shrinkage & Resampling Shrinkage and resampling techniques (like this R-module) can help you to find out how well your model might fit a new sample. Automated Methods Automated stepwise regression shouldn’t be used as an overfitting solution for small data sets. PROBLEM STATEMENT COVID-19 is an acute and severe viral respiratory infection, with high fever, fatigue, cough and shortness of breath as the most common symptoms. The ongoing pandemic has affected more than 184 million people in the world, with severe public health and economic/financial crisis globally, and countries such as Senegal were constrained to the most restrictive preventive measures such as lock downs, closure or limited openings of shops and schools. As of July 9, 2021, the Word Health Organization’s (WHO) statistics reported that death rate of detected infectious individuals with COVID-19 was on average of 2.16 percent (0.1%–9.3%) depending on individuals’ age group and other factors such as health status as well as immunity system among the many factors . However, with the persistent issue of under-reporting, asymptomatic or untested cases, one would expect the actual case fatality rate to be (perhaps much) lower. In Senegal, COVID-19 third wave tends to have a higher peak than the first and second waves, with the number of deaths following the same pattern. A powerful tool used to mitigate the spread and potential impact of the pandemic is mathematical modeling to forecast the long term dynamical spread of the disease, and to help support, using data the effective allocation and mobilization of scarce public health resources. AIM OF THE REGRESSION ANALYSIS We aim to investigate the spread of COVID-19 by fitting a generalized Waring regression model (GWRM) to the number of new daily cases in Senegal, and make short time predictions. The application of this model to COVID-19 is to the best of our knowledge relatively new. Next, we compare the performance of this model with that of the Poisson regression model (PRM) and the negative binomial regression model (NBRM). An extension of the negative binomial distribution that allows three sources of variation to be distinguished is the univariate generalized Waring distribution (UGWD), while the GWRM is a regression model with a UGWD as its underlying distribution. One main advantage of the implementation of the GWRM over other traditional models such as the NBRM is that the GWRM distinguishes individual heterogeneity due to intrinsic (internal) factors, and those due to external factors (e.g., covariates that influence the variability of data but because they cannot be observed or measurable are not included in the model) DATA Our dataset consists of the historical daily new confirmed COVID-19 cases in Senegal. Cases during the first wave of the outbreak were collected from the Senegalese Ministry of Health website (https://www.sante.gouv.sn/), available at this github repository (https://github.com/senegalouvert/COVID-19). The timeline of the observations are from 2nd of March 2020 to the ends of 22nd of October 2020. The daily reported number of new cases and the cumulative number of reported cases are depicted in Fig. 2 while the ratio number of reported cases over number of screening tests is depicted in Fig. 3. We simulate the possibility of near real-time predictions during an actual outbreak to test the predictive power of each of the models by using 90% of the data for estimation, and the remaining 10% for evaluation. The other variables of the data used are the following: • the number of daily screening tests • the number of daily contact COVID-19 cases. A contact case is a COVID-19 case transmission in which the person has been infected by another person with whom he was in contact, whether it is a member of family, a friend, or a neighbor, etc . . . That is to say the person exactly knows the person who contaminated him/her and where he/she was infected. • the daily number of community COVID-19 cases. A community case is a COVID-19 case transmission in which the person does not know where he is infected. • the number of daily COVID-19 imported cases (from foreign countries or from outside the community). • the number of total recovered cases. • the number of total deaths due to COVID-19. • the number of total evacuated cases at foreign countries. • the number of daily critical cases. The descriptive statistics of all variables used in this paper are summarized in Table 1 L. Gning, C. Ndour and J.M. Tchuenche Physica A 597 (2022) 127245 where λxt −τ = exp x′t −τ β . It follows that: ( ) Var(Yt |Ft −τ ) = E(Yt |Ft −τ ) = λxt −τ = exp x′t −τ β . ( ) (2) The conditional mean is then equal to the conditional variance (equidispersion) for this model that constitutes a major shortcoming for its application on count data. 3.2. Negative binomial regression model (NBRM) In this model, we make the following distributional assumption for Yt given Ft −τ : Yt |Ft −τ , λxt −τ ∼ Poisson λxt −τ , ) ( (3) where λxt −τ is a gamma distributed variable: λxt −τ ∼ Gamma(θ, vxt −τ ). (4) Then, using straightforward calculations we obtain (see, for example, [14,19] or [20]): Yt |Ft −τ ∼ NegativeBinomial θ, pxt −τ , ( ) (5) where pxt −τ = 1/(1 + vxt −τ ). Its follows, for the NBRM, that the conditional mean is given by: µxt −τ = E(Yt |Ft −τ ) = E(λxt −τ ) = θvxt −τ . (6) Remark that this model is a Poisson–gamma mixture model and, for the gamma distribution, the parameter θ does not depend on the covariates contrary to vxt −τ . The model parametrized that way is referred as NegbinII model in the literature (see [19]). The variance of the model is now given by: Var(Yt |Ft −τ ) = E(λxt −τ ) + Var(λxt −τ ) = µxt −τ + 1 θ µ2xt −τ (7) The decomposition of this variance is interpreted as follow: the first term of the model variance represents the variability due to randomness inherent in any random phenomenon and the second to differences between the days between 2 March 2020 and 22 October 2020 in our practical application. For more details on the interpretation of the variance decomposition see [14] or [20]. 3.3. Generalized waring regression model (GWRM) The GWMR is an extension of NBRM. Contrary to the last model, it allows to distinguish the variability due to individual differences related to the covariates included in the model, to the one only due to individual differences not related to the model covariates [14,20]. Next, we present a brief background on the genesis of the GWRM and some properties of the model. In this model we assume that Yt given Ft −τ has a univariate generalized Waring distribution: Yt |Ft −τ ∼ UGWD θxt−τ , k, ρ . ( ) (8) This implies that the probability mass function of Yt |Ft −τ is given by [16]: f (y|x) = Γ (θx + ρ )Γ (k + ρ ) (θx )y (k)y 1 , y = 0, 1, 2, . . . , Γ (ρ )Γ (θx + k + ρ ) (θx + k + ρ )y y! (9) Γ (α+r) where θx , k, ρ > 0 and (α )r = Γ (α ) for α > 0 and Γ is the gamma function defined by Γ (x) = x > 0. Then, the conditional mean is given by: E(Yt |Ft −τ ) = µxt −τ = ∫∞ 0 ux−1 exp(−x) dx, for θxt −τ k , (ρ > 1). ρ−1 (10) In addition, linking the conditional mean with the explanatory variables as µxt −τ = exp x′t −τ β , we have: ( θxt −τ = µxt −τ (ρ − 1) . (11) k Furthermore, the variance of the GWRM is given by [14,20]: Var(Yt |Ft −τ ) = µxt −τ + k+1 ρ−2 k + ρ − 1 µxt −τ ) 2 µxt −τ + ρ−2 k , (ρ > 2). (12) The decomposition of this variance is interpreted as follow: • the first term referred as randomness represents the variability inherent to any random phenomenon • the second one called liability represents the variability due to individual differences related to the explanatory variables 4 L. Gning, C. Ndour and J.M. Tchuenche Physica A 597 (2022) 127245 • the third one called proneness is the variability due to individual differences not related to the explanatory variables included in the model Finally, note that the GWRM is a NBRM-beta mixture model and then a limiting case of the NBRM [14]. For further details of this regression model the reader can be refer to [14] or [20]. 3.4. Maximum likelihood estimation Each of the three models considered above, namely the PRM, the NBRM and the GWRM was fitted using the maximum likelihood estimation. That is by maximizing: ℓ1 (β) = ( T ∑ ) yt x′t −τ β − exp(x′t −τ β) − log(yt !) , (13) t =τ +1 ( T ∑ exp(x′t −τ β) ( ) ( )) exp(x′t −τ β) Γ (θ + y t ) − log 1 + + log , θ + exp(x′t −τ β) θ Γ (θ ) + yt ! t =τ +1 ( ) )) ( ( T ∑ (ρ − 1) exp x′t −τ β Γ (θxt−τ + ρ )Γ (ρ + k)Γ (θxt−τ + yt )Γ (k + yt ) , where θxt −τ = . ℓ3 (β, ρ, k) = log Γ (ρ )Γ (θxt−τ )Γ (k)Γ (θxt−τ + k + ρ + yt )yt ! k ℓ2 (β, θ ) = ( yt log ) (14) (15) t =τ +1 The numerical calculations of the maximum likelihood estimators of the model parameters are carried out with the open source software R [21]. Note that, because PRM and NBRM are special cases of the generalized linear model, we estimate their parameters using the Iteratively re-weighted least squares (IRLS) algorithm (for more details see [22]). The PRM case is implemented in the glm function included in the R package stats, while the NBRM case is implemented in the glm.nb included in the package MASS. Finally, we use the R package GWRM developed by [20] to fit the GWRM. The algorithm used in this package is based on non linear minimization, Nelder–Mead optimization and L-BFGS-B methods (see [20]). For adaptation and reproducibility, the source codes of all these packages are available on the Comprehensive R Archive Network (CRAN) repository (http://cran.r-project.org/web/packages/GWRM). 4. Application New daily COVID-19 cases is considered as a discrete count variable over-dispersion tendency. With comparison with the Poisson model, several causes might be responsible of this variability. First, there are external observable factors that could significantly influence the number of daily COVID-19 cases, such as – x1 : the number of daily contact COVID-19 cases. – x2 : the daily number of community COVID-19 cases. – x3 : the number of daily COVID-19 imported cases. Second, there are missing external covariates which would affect the response variable, and the GWRM could provide more information on the data through the quantification of the effect and sources of variability. Finally, the variable number of daily screening COVID-19 tests, has been included as an offset in the models. This means that the ratio number of new daily COVID-19 cases/number of daily screening COVID-19 tests is used as response variable for all models. Choice of τ . The 14-day COVID-19 incubation period from exposure to symptoms onset is denoted by the parameter τ — the median time is about 5 days [23]. In a machine learning point of view, τ can be considered as a hyper-parameter, then, in the following, we use a grid search process to automate the tuning of τ . In Fig. 4, we plot the Root Mean Square Error (RMSE) of the models as functions of τ , and it can be observed that τ = 3 is the optimal value. 4.1. Results and discussion We fit the models from Section 3 to Senegal data. The computed values of the log-likelihood function shown in Table 2 enables us to evaluate the accuracy of the model fit, the Akaike and the Bayesian information criteria. In addition, for each model, using a cross-validation procedure, we calculate in Table 2 the values of RMSE, the Mean Absolute Error (MAE), and the Mean Error (ME). Our results show that the best model out of the three is the Generalized Waring regression model (GWRM). In Table 3, we display the maximum likelihood parameter estimates for the GWRM (the best fitted model), their respective standard errors, and the associated partial Wald tests (statistics and p-values). The estimates of k and ρ are respectively 53.68050 and 12.85094, with standard errors of 46.012067 and 4.668843. Note that the GWRM includes only the daily number of COVID-19 contact cases as a significant covariate at 5% significance level. 5 4. Application New daily COVID-19 cases is considered as a discrete count variable over-dispersion tendency. With comparison with the Poisson model, several causes might be responsible of this variability. First, there are external observable factors that could significantly influence the number of daily COVID-19 cases, such as – x1: the number of daily contact COVID-19 cases. – x2: the daily number of community COVID-19 cases. – x3: the number of daily COVID-19 imported cases. Second, there are missing external covariates which would affect the response variable, and the GWRM could provide more information on the data through the quantification of the effect and sources of variability. Finally, the variable number of daily screening COVID-19 tests, has been included as an offset in the models. This means that the ratio number of new daily COVID-19 cases/number of daily screening COVID-19 tests is used as response variable for all models. Choice of τ . The 14-day COVID-19 incubation period from exposure to symptoms onset is denoted by the parameter τ — the median time is about 5 days [23]. In a machine learning point of view, τ can be considered as a hyper-parameter, then, in the following, we use a grid search process to automate the tuning of τ . In Fig. 4, we plot the Root Mean Square Error (RMSE) of the models as functions of τ , and it can be observed that τ = 3 is the optimal value. 4. Results and Discussion We fit the models from Section 3 to Senegal data. The computed values of the log-likelihood function shown in Table 2 enables us to evaluate the accuracy of the model fit, the Akaike and the Bayesian information criteria. In addition, for each model, using a cross-validation procedure, we calculate in Table 2 the values of RMSE, the Mean Absolute Error (MAE), and the Mean Error (ME). Our results show that the best model out of the three is the Generalized Waring regression model (GWRM). In Table 3, we display the maximum likelihood parameter estimates for the GWRM (the best fitted model), their respective standard errors, and the associated partial Wald tests (statistics and p-values). The estimates of k and ρ are respectively 53.68050 and 12.85094, with standard errors of 46.012067 and 4.668843. Note that the GWRM includes only the daily number of COVID-19 contact cases as a significant covariate at 5% significance level. Now, we focus on how well the GWRM fits the observed data. We investigate this issue by carrying out a visual residual analysis. Indeed, in Figs. 5, 6, and 7, we depict the quantile quantile plots (QQ-plots) with simulated envelope for residuals of type deviance, of type Pearson and of type response (for more details see [24,25]), [20]. For the deviance and Pearson residuals, all the plotted points fall within the boundaries of the envelope, which indicates a good accuracy of the fitted model. However, for the response residuals, there are some points outside the envelope, a sign of a potential lack of accuracy, and hence no evidence against the adequacy of the fitted model. Thus, the GWRM is appropriate to fit the data. We observe that, on average, the proneness represents 50.18 percent of the variability of the model while the liability represents 41.57 percent and the randomness 8.25. This means about 50% of the variation of the model are due to external factors that are unknown because they could be difficult to observe or to measure. Fig. 8 depicts the relationship between the proportion of each component of the variance and the daily new COVID-19 cases. It can be seen that, on average, the randomness and the liability decrease as the number of new daily COVID-19 cases increase, while the proneness increases on average. Fig. 9 shows the proportion of the variance related to randomness, liability and proneness, in terms of the number of screening tests. In general, the variability in the number of new daily COVID-19 cases due to randomness and liability decreases as the number of screening tests increases, whereas the variability due to proneness increases. So, we can deduce that increasing the number of screening tests emphasizes the role of the daily characteristics as a cause of differences between days in relation to the number of new daily COVID-19 cases, whereas randomness and liability are somewhat less relevant. The Figs. 10–12 illustrate the relationships between the model’s covariates and the variance related to randomness, liability and proneness. In general, the variability in the number of daily new COVID-19 infections due to randomness and liability decreases as the number of daily contact or the number of daily community cases increases, whereas the variability due to proneness increases. Thus, an increase of the daily number of cases emphasizes the role of the proneness, whereas randomness and liability become less relevant. In contrast, the variability in the number of new daily infections varies randomly as the number of daily imported cases increases (Fig. 12). However the fraction of proneness is globally more important than the other sources of variation. It is important to note that the imported cases represent 2.24% of the total COVID-19 cases in Senegal during the first stage outbreak with a peak at 27 on June 13, 2020, and, the GWRM does not include this variable as a significant covariate at 5% significance level. 5. Conclusion The rapid spread of the COVID-19 pandemic has triggered substantial economic and social disruptions worldwide. Several mathematical studies have been proposed to investigate and forecast long term dynamics of the disease. We use count regression models such as the generalized Waring regression model to forecast the daily confirmed new COVID-19 cases in Senegal. We fitted the generalized Waring model (GWRM) to the Senegal data (daily number of confirmed new COVID-19 infections). Results from this study reveal that the generalized Waring regression model fits better the data than most of the usual count data such as the negative binomial regression model (NBRM). Therefore the former model can explain specific characteristics of the studied phenomenon. We also analyze the variance decomposition for each day. An advantage of the GWRM is that it accounts for factors that are known to affect the number of daily COVID-19 cases. This study is not exhaustive, and future work can explore several other avenues such as modeling the daily mortality due to Covid-19 using a GWRM, investigating (1) the impact of testing capacity on the number of detected daily cases [26], (2) the effect of various prevention and therapeutic intervention measures on COVID-19 dynamics. One can also try to model the daily number of COVID-19 cases using Generalized Waring auto-regressive process in order to understand better the COVID-19 dynamics. A possible issue with daily confirmed cases data is the potential of weekly seasonality effects, few tests performed over weekends, hence less cases and oscillations in data. In our future study, we pan to investigate – if there are any such issues in the Senegal data and how to correct this for further improvements – whether the generalized Waring regression model could capture the disease multiple wave References . https://stackabuse.com/linear-regression-in-python-with-scikit-learn https://www.statisticssolutions.com/assumptions-of-linear-regression https://en.wikipedia.org/wiki/Coefficient_of_determination https://en.wikipedia.org/wiki/Polynomial_regression https://www.javatpoint.com/machine-learning-polynomial-regression : www.elsevier.com/locate/physa