1 HARAMAYA UNIVERSITY COLLEGEE OF HEALTH SCIENCE AND MEDICINE SCHOOL OF POSTGRADUATE STUDIES PROJECT WORK ON Relationship Between Number of Cigarettes Smoked Per Day and determinants of lung cancer A RESEARCH PROJECT SUBMITTED TO:- MR. ADISU( MPH, BIOSTATISTICS) PREPARED BY: 1. MOHAMMED ADUS 2. REGASSA DADI 3. DAWIT WENGELU 4. TEMESGEN KANSI 5. NEGASH ASEFA 6. DESALEGN ADUGNA SEP/ 2022 HARAR, ETHIOPIA 2 Contents INTRODUCTION ................................................................................................... Ошибка! Закладка не определена. Methods and material ........................................................................................ Ошибка! Закладка не определена. Present the descriptive statistics (graph, table, charts, measures of central tendency and dispersion) for important variable ............................................................................................................... Ошибка! Закладка не определена. Measure of central tendency and dispersion ................................................. Ошибка! Закладка не определена. Identify the determinants of “number of cigars smoked each day” (i.e. the uncategorized data) using relevant risk factors (use the appropriate statistical method, please!). Interpret all the relevant statistical outputs you get including its model assumptions................................................................................................................................. 13 Interpretation for linear regression ........................................................................................................................ 14 Model Assumptions ................................................................................................................................................ 15 Interpretation for linear regression ............................................................................................................................ 18 Use appropriate model for identifying the factors associated with lung cancer ....................................................... 19 INTERPRETATION .................................................................................................................................................... 23 Determine the 95% confidence interval for coefficients of variables, check model adequacy and Interpret all the relevant outputs from each model you fitted above (Question 5) ............................................................................ 23 Results ......................................................................................................................................................................... 24 Interpretation ......................................................................................................................................................... 24 Discussion ........................................................................................................... Ошибка! Закладка не определена. CONCLUSION....................................................................................................... Ошибка! Закладка не определена. REFERANCES........................................................................................................ Ошибка! Закладка не определена. 3 Introduction The main research question for this paper is as follows, “Is there relationship Between Number of Cigarettes Smoked Per Day and determinants of lung cancer income, house hold size, sex, marital status, stick of smoking an this research question is significant because the results of the data would determine the main reasons that cause people in the r to smoke different numbers of cigarettes per day. The dependent variable in the study is the number of cigarettes smoked. From the data provided, the number of cigarettes is named as cigs and ranges from a low of zero to the highest value of 80. The smoking of cigarette is an unhealthy behavior which is widespread all over the world. It is the leading reason for premature death. Universally, around twenty percent of grownup or matured individuals smoke ciggies, occasioning roughly a hundred million deaths throughout the 20th C. Socioeconomic status(SES) has been put into consideration as the most essential determinant of the behaviors of smoking. Based on the philosophy of diffusion of innovation, the four phases of smoking have been described (Qing Wang, 2018). According to the initial phase, innovators or the higher socioeconomic assemblages pervades smoking and in the second phase, smoking spreads to the entire population (including the lower socioeconomic assemblages). The third phase is categorized by the flinch of ending in higher socioeconomic assemblages, the male dominance, and an increase in female smoking. Lastly, in the fourth phase, smoking deteriorates amongst the higher socioeconomic assemblages but resists high amongst the lower socioeconomic status. Therefore, the effects of SES on smoking manners may vary across republics having dissimilar levels of socioeconomic development 4 Smoking cigarette extremely affects the health of individuals with low socioeconomic status. Lower-salary cigarette smokers are affected most by illnesses triggered by smoking than those smokers with higher earnings. They have a higher risk of lung cancer than those from the rich assemblages. Also, those with very little high school education have higher incidences of lung cancer than those who have undergone college education[CITATION Cig21 \l 1033 ]. Again, populations with lower income have less contact with health care services hence making many individuals be diagnosed with various diseases as well as conditions related to smoking at a later stage. Methodology This paper explores the impacts of of smoking, age, time of smoking, family size and others.and income on the cigs which is the number of cigarettes an individual smoke in a day Statistical analysis is done by using stata S/E version 15. This study was conducted with1250 study population by using two independent t-test for continuous outcomes variable (Average number of cigarette stick consume per day) with categorical variable with two group (sex, status and marital status). Anova for more than two independent group (educational status) . Linear regression for continuous outcome variable (Average number of cigarette stick consume per day) and other independent variable such as month of smoking, age, time of smoking, family size and others. Linear regression assumption was assessed: 1. Linearity- by using two way scatter plot, 2. Normality was checked by kernel density 3. Homoscedasticity by 5 imtest and hettest and finally multi-collinearity by vif. We checked our model selection by backward elimination and stepwise selection And finally logistic regression for binary outcome (lung cancer) and independent variables such as stick consumed, time of smoking, age, status income and others. The regression model used is as follows Cigs = + restaurn, + cigprice + income + white + age + age^3 + age ^4 The independent variables are restaurn which is expected to impact cigs positively. That means that if the state doesn’t have smoking restrictions, it is expected that the number of cigarettes smoked per day would increase. Cigprice which is the state cigarette price is expected to affect the number of cigs positively because the lower the prices the more the number of cigarettes people would smoke daily. Age is a factor expected to affect cigs positively. The higher the age, the more the income and more number of cigs smoked. Income is expected to impact cigs used positively where the higher the income of the people, the more the number of cigs are smoked. Description of the data The data consisted of 11 variables namely month_smoking ,age_years ,marital_status, sticks_consu, educ_status, time_smoking, sex. hh_size, status ,income_month lung_cancer Educ represented Level of education, age was the age of the person in years, income represented the months income of the individual in birr, cigs was the number of cigarettes smoked per day, Descriptive Statistics 6 Median for the length of time in month until the resumption of cigarette smoking was 33. 16 observation were blow 25 percentiles and 70 observation were above 95 percentiles for the above table 7 Median for the length of time in month until the resumption of cigarette smoking was 33. 16 observation were blow 25 percentiles and 70 observation were above 95 percentiles for the above table 8 Figure 4 Lung cancer over house hold size 9 Figure 6 Box plot for average number of cigarette sticks consumed each day during first phase. 10 We use two independent sample t-test for gender, sticks censor and marital status . ttest sticks_consu, by(sex) Two-sample t test with equal variances Group Obs Mean Male Female 604 646 11.83113 11.46904 combined 1,250 diff Std. Err. Std. Dev. [95% Conf. Interval] .413118 .412709 10.15296 10.48963 11.0198 10.65862 12.64245 12.27946 11.644 .2920572 10.32578 11.07102 12.21698 .3620856 .5845887 -.7847994 1.508971 diff = mean(Male) - mean(Female) Ho: diff = 0 Ha: diff < 0 Pr(T < t) = 0.7321 t = degrees of freedom = Ha: diff != 0 Pr(|T| > |t|) = 0.5358 0.6194 1248 Ha: diff > 0 Pr(T > t) = 0.2679 Assumptions of two independent sample t-test The variance of the dependent variable in the two populations are equal The dependent variable is normally distributed within each population The data are independent (scores of one participant are not related systematically to the scores of the others) Hypothesis: Ho: μm = μf Vs HA: μm ≠ μf We conclude that there is no significance mean cigarette smoked each day difference between male and female. Because p value is greater than 0.05. That means we fail to reject null hypothesis. 11 . ttest sticks_consu, by(marital_status) Two-sample t test with equal variances Group Obs Mean Never ma Married 639 611 combined 1,250 diff Std. Err. Std. Dev. [95% Conf. Interval] 11.84664 11.43208 .4280062 .3959561 10.81933 9.787407 11.00616 10.65448 12.68711 12.20968 11.644 .2920572 10.32578 11.07102 12.21698 .4145568 .5843772 -.7319134 1.561027 diff = mean(Never ma) - mean(Married) Ho: diff = 0 Ha: diff < 0 Pr(T < t) = 0.7609 t = degrees of freedom = Ha: diff != 0 Pr(|T| > |t|) = 0.4782 0.7094 1248 Ha: diff > 0 Pr(T > t) = 0.2391 Hypothesis: Ho: μm = μf Vs HA: μnm ≠ μf We conclude that there is no significance mean cigarette smoked each day difference between married and never married. Because p value is less than 0.05. That means we fail to reject null hypothesis. . ttest sticks_consu, by(status) Two-sample t test with equal variances Group Obs Mean censored Resumed 896 354 combined 1,250 diff Std. Err. Std. Dev. [95% Conf. Interval] 13.31808 7.40678 .3635118 .3830718 10.88109 7.207453 12.60465 6.65339 14.03152 8.16017 11.644 .2920572 10.32578 11.07102 12.21698 5.911301 .626519 4.682154 7.140447 diff = mean(censored) - mean(Resumed) Ho: diff = 0 Ha: diff < 0 Pr(T < t) = 1.0000 t = degrees of freedom = Ha: diff != 0 Pr(|T| > |t|) = 0.0000 9.4352 1248 Ha: diff > 0 Pr(T > t) = 0.0000 Hypothesis: Ho: μc = μr Vs HA: μc ≠ μfr We conclude that there is significance mean cigarette smoked each day difference between censored and resumed. Because p value is less than 0.05. That means we reject null hypothesis. 12 One way Anova for more than two population . oneway sticks_consu educ_status Source Between groups Within groups Total Analysis of Variance SS df MS F 2440.46591 130730.114 4 1245 610.116477 105.004108 133170.58 1249 106.621761 Bartlett's test for equal variances: chi2(4) = Prob > F 5.81 15.5655 0.0001 Prob>chi2 = 0.004 Assumptions for one way anova The outcome is normally distributed. Population variance is assumed constant among the groups. Independent random samples among the groups. Ho : µ1 = µ2 = : : : =µ k , HA : at least one of the means is different. We reject the null hypothesis (p value < 0.05) and We can conclude that at least one of the groups' means differ on cigarette smoked each day. 13 The average number of cigaretee sticks consumed each day during the first phase by Level of education (Bonferroni) Row MeanCol Mean Did not High sch Some col College High sch -2.09707 0.124 Some col -3.23212 0.003 -1.13504 1.000 College -3.94306 0.000 -1.84599 0.229 -.710946 1.000 Post-und -3.16846 0.149 -1.07139 1.000 .063655 1.000 .774601 1.000 Now the question is: which groups are different? Answering this question requires multiple comparisons. Bonferroni method corrects probability of Type I error for the number of tests. All pairs of the below comparison are statistically significant at 0.05 level: some college vs did not, college vs did not . Identify the determinants of “number of cigars smoked each day” (i.e. the uncategorized data) using relevant risk factors (use the appropriate statistical method, please!). Interpret all the relevant statistical outputs you get including its model assumptions. We used linear regression model for our continuous variable (average number of cigarette consumed each day) and other independent variable we used 14 Variable selection based on significance in multivariable model: . stepwise, pr(.01): regress sticks_consu month_smoking age_years marital_status educ_status time_smoking sex > onth begin with full model p = 0.8380 >= 0.0100 removing income_month p = 0.7364 >= 0.0100 removing marital_status p = 0.5188 >= 0.0100 removing sex p = 0.2628 >= 0.0100 removing educ_status p = 0.1733 >= 0.0100 removing status Source SS df MS Model Residual 67575.7398 65594.8402 4 1,245 16893.9349 52.6866187 Total 133170.58 1,249 106.621761 sticks_consu Coef. month_smoking age_years time_smoking hh_size _cons .1399341 .4584523 -.0829848 -.4173792 -10.54407 Std. Err. .0114573 .022728 .0282136 .1481988 .9355041 t 12.21 20.17 -2.94 -2.82 -11.27 Number of obs F(4, 1245) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.003 0.005 0.000 = = = = = = 1,250 320.65 0.0000 0.5074 0.5059 7.2586 [95% Conf. Interval] .1174563 .413863 -.1383363 -.7081262 -12.37941 .1624118 .5030417 -.0276334 -.1266321 -8.708731 Interpretation for linear regression Dependent variable: Number of cigars smoked each day As month of smoking increase by one unit the number of cigarette smoking each day was increased by 14% keeping others variable constant As age increase by one year the number of cigarette smoking each day was increased by 45.8% keeping others variable constant As time of smoking increase by one unit the number of cigarette smoking each day was reduced by 8.3% keeping others variable constant As family size increase by one person the number of cigarette smoking each day was decreased by 41.7% keeping others variable constant 15 Model Assumptions 0 20 40 60 Linearity: - Relationship between independent and dependent variable is linear, so the linearity assumption is meet. -20 0 Residuals 20 Fig 7 Relationship between residuals and the average number of cigarette sticks consumed each day during fist phase Normality is normally Distributed Error Terms. So the normality assumption is meet. 16 .4 .2 0 -2 -1 0 1 Pearson residual 2 3 Kernel density estimate Normal density kernel = epanechnikov, bandwidth = 0.2196 -2 0 2 4 fig 8 Normality Distribution of Error Terms. -4 Density .6 .8 Kernel density estimate -4 -2 0 Inverse Normal 2 4 17 Fig 9 Distribution of standardized residuals and inverse normal Homoscedasticity: - Variance of the error terms is constant. Is about homogeneity of variance of the residuals. -20 0 Residuals 20 Homoscedasticity assumption is not meet. The variance of the residuals is non-constant. It is heteroscedastic. 0 10 20 Fitted values 30 40 Fig 10 Homoscedasticity of residuals and fitted values Multi-collinearity: - When there is a perfect linear relationship among the predictors, the estimates cannot be uniquely computed. We can use the vif command after the regression to check for multi-collinearity. As a rule of thumb, a variable whose values are greater than 18 10 may need further investigation. In this case vif is less than 10 so, there is no multicollinearity. . vif Variable VIF 1/VIF time_smoking age_years month_smok~g 1.90 1.86 1.43 0.525649 0.537630 0.701662 Mean VIF 1.73 4 Interpret all the relevant outputs from each model you fitted above (Question 3) Interpretation for linear regression Dependent variable: Number of cigars smoked each day • As month of smoking increase by one unit the number of cigarette smoking each day was increased by 14% keeping others variable constant • As age increase by one year the number of cigarette smoking each day was increased by 45.8% keeping others variable constant • As time of smoking increase by one unit the number of cigarette smoking each day was reduced by 8.3% keeping others variable constant As family size increase by one person the number of cigarette smoking each day was decreased by 41.7% keeping others variable constant . 19 As family size increase by one person the number of cigarette smoking each day was decreased by 41.7% keeping others variable constant Use appropriate model for identifying the factors associated with lung cancer We firstly see whether there is an association between cigarette smoking (exposure) and lung cancer (outcome) ignoring the other potential confounders. Stepwise logistic regression using the likelihood ratio test 20 . logit lung_cancer sticks_consu, or Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -739.29478 -730.84004 -730.79539 -730.79539 Logistic regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -730.79539 lung_cancer Odds Ratio sticks_consu _cons 1.024798 .2862447 Std. Err. .0060488 .0281062 z 4.15 -12.74 = = = = 1,250 17.00 0.0000 0.0115 P>|z| [95% Conf. Interval] 0.000 0.000 1.013011 .236134 1.036723 .3469895 . est store a . logit lung_cancer time_smoking sticks_consu Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -739.29478 -702.50812 -702.02782 -702.02779 Logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -702.02779 lung_cancer Coef. time_smoking sticks_consu _cons .0519246 .0020913 -1.595195 Std. Err. .0069667 .006876 .1125756 z 7.45 0.30 -14.17 P>|z| 0.000 0.761 0.000 = = = = 1,250 74.53 0.0000 0.0504 [95% Conf. Interval] .0382701 -.0113855 -1.815839 .0655791 .0155681 -1.374551 . est store b . lrtest b a Likelihood-ratio test (Assumption: a nested in b) LR chi2(1) = Prob > chi2 = 57.54 0.0000 21 . logit lung_cancer time_smoking sticks_consu age_years Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -739.29478 -700.20785 -699.66674 -699.66669 Logistic regression Number of obs LR chi2(3) Prob > chi2 Pseudo R2 Log likelihood = -699.66669 lung_cancer Coef. time_smoking sticks_consu age_years _cons .0419965 -.0071338 .017522 -2.117346 Std. Err. .0082204 .0079904 .0079996 .2667801 z 5.11 -0.89 2.19 -7.94 P>|z| 0.000 0.372 0.028 0.000 = = = = 1,250 79.26 0.0000 0.0536 [95% Conf. Interval] .0258849 -.0227948 .001843 -2.640225 .0581082 .0085271 .0332009 -1.594467 . est store c . lrtest c b Likelihood-ratio test (Assumption: b nested in c) LR chi2(1) = Prob > chi2 = 4.72 0.0298 22 . logit lung_cancer time_smoking sticks_consu age_years status Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -739.29478 -698.61322 -697.99343 -697.99328 -697.99328 Logistic regression Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -697.99328 lung_cancer Coef. time_smoking sticks_consu age_years status _cons .0398203 -.0090132 .0172915 -.2929866 -1.98381 Std. Err. .0082956 .0080604 .0080068 .1617566 .2755864 z 4.80 -1.12 2.16 -1.81 -7.20 P>|z| 0.000 0.263 0.031 0.070 0.000 = = = = 1,250 82.60 0.0000 0.0559 [95% Conf. Interval] .0235612 -.0248113 .0015984 -.6100237 -2.523949 .0560795 .0067848 .0329847 .0240505 -1.44367 . est store d . lrtest d c Likelihood-ratio test (Assumption: c nested in d) LR chi2(1) = Prob > chi2 = 3.35 0.0673 Status, income, house hold size, sex, marital status, stick of smoking and others are not improve our model when we insert one by one. Finally we get the below model 23 . logit lung_cancer age_years month_smoking time_smoking sticks_consu, or Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -739.29478 -696.7724 -696.12937 -696.12923 -696.12923 Logistic regression Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -696.12923 lung_cancer Odds Ratio age_years month_smoking time_smoking sticks_consu _cons 1.017153 1.010189 1.036079 .9854168 .1002543 Std. Err. .0081701 .0038516 .0088738 .0083632 .0278405 z 2.12 2.66 4.14 -1.73 -8.28 = = = = 1,250 86.33 0.0000 0.0584 P>|z| [95% Conf. Interval] 0.034 0.008 0.000 0.083 0.000 1.001265 1.002669 1.018832 .9691608 .0581735 1.033292 1.017767 1.053618 1.001946 .1727748 INTERPRETATION As age increase by one year odds of lung cancer will increase in averagely by 1.7% keeping others variable constant As months of smoking increase by one unit odds of lung cancer increase in averagely by 1% keeping others variable constant As time of smoking increase by one unit odds of lung cancer increase in averagely by 3.6% keeping others variable constant Determine the 95% confidence interval for coefficients of variables, check model adequacy and Interpret all the relevant outputs from each model you fitted above (Question 5) . estat gof Logistic model for lung_cancer, goodness-of-fit test number of observations number of covariate patterns Pearson chi2(991) Prob > chi2 = = = = 1250 1000 1017.56 0.2722 This model is well fitted because p-value is greater than 0.05 24 Results This study was conducted on 1250 respondents, among which 604 male and 646 female. From thi participants 639 were never married and 611were married, 896 were censored and 354were resumed to smoking. Their educational status 258 (20.64%) did not complete high school, 356(28.48%) were complete high school, 264(21.12%) some college, 290(23.2%) college degree and 82(6.56%) post under graduate degree. Association of lung cancer with age of respondent by using logistic regression as follow. . logit lung_cancer ib(2).cat1age_years, or Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -739.29478 -724.55172 -723.71048 -723.70486 -723.70486 Logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -723.70486 lung_cancer Odds Ratio cat1age_years Adult siner 4.347222 9.175781 1.632185 4.152295 _cons .091954 .0339725 Interpretation: Std. Err. z = = = = 1,250 31.18 0.0000 0.0211 P>|z| [95% Conf. Interval] 3.91 4.90 0.000 0.000 2.082688 3.779643 9.074015 22.2759 -6.46 0.000 .0445752 .1896917 25 The odd of lung cancer among seniors was 9.2 times higher than young Among the adult and senior the incidence of cancer was higher in senior (>=65years) then adult The model specification and possible biasness would arise when the possible independent variable is omitted from the model. This is the main reason why the results first test all variables related to the dependent variable before eliminating those that are not significant[ CITATION Gup20 \l 1033 ]. Looking at OLS assumptions, the following is an analysis of the model to view if it satisfies the OLS assumptions. Ordinary Least Squares (OLS) method approximates the parameter in the regression model. OLS parameters reduce the sum of squared errors (observed values – predicted values). First assumption is that the linear elements. The dependent variable is linear in parameters. There exists random sampling as per the variables. A test for heteroscedasticity is as shown in Conclusion We found, as expected, that lung cancer and smoking are positive association to each other’s. Age in years, time of smoking, months of smoking and sticks consumes are positive association with lung cancer both in adjusted and crude. (Fig. 1). For example if we take time of smoking as time of smoking increase by one year odds of lung cancer increase by 3%. It is similar for time of 26 smoking, as time of smoking increase by one unit odds of lung cancer increase in averagely by 3.6% and also as age increase by one year odds of lung cancer will increase in averagely by 1.7% at p value less than 0.05 as table below shows. This also similar with research done in many different county of Africa We also discovered that, according to our knowledge on epidemiology lung cancer has positive association with number of stick consume rather than p value greater than 0.05. Except heteroscedasticity almost most of our linear assumption will meet The question to be answered is what are the parameters that impact cigs? The project looked at the factors that could be impacting cigs. Some of the factors established from the regression model. The three variables impacted cigs significantly. These results can be used by the government and healthcare to come up with ways of moderating the number of cigarettes in the country. The same can be utilized by cigarette company to produce and sell their cigarettes to people. Further research should involve more variables because this only limited independent variables. 27 Bibliography CDC. 2021. Cigarette Smoking and Tobacco Use AMong People of low Socioeconomic Status. https://www.cdc.gov/tobacco/disparities/low-ses/index.htm. Gupta. 2020. "Specification Bias." https://rlacollege.edu.in/pdf/Statistics/specification-bias.pdf. Julian Perelma, Joana Alves, Timo‐Kolja Pfoertner, Irene Moor, Bruno Federico, Mirte A. G. Kuipers, Matthias Richter, Arja Rimpela, Anton E. Kunst, and Vincent Lorant. 2017. December. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5698771/. Mullahy. 1997. "Instrumental-Variable Estimation of Count Data Models: Applications to Models of Cigarette SMoking Behavior." Review of Economics and Statistics 79, 596593.