OLS & Logistic Regression Analysis – A Recap Cristina Penaloza & Eoin Maloney Health Economics Unit 1 Outline • What is regression analysis? • Relevance of regression analysis • Regression modelling process – OLS regression – Logistic regression • Exercise 2 What is Regression Analysis? “Regression analysis is concerned with the study of the dependence of one variable, the dependent variable, on one or more other variables, the explanatory variables, … with a view to estimating and/or predicting the (population) mean or average value of the dependent variable in terms of known or fixed (in repeated sampling) values of the explanatory variables.” Gujarati (1995: 16) 3 Terminology Dependent variable, explained variable, outcome variable, outcome, response variable, regressand, output variable, predicted value, predictand, endogenous Explanatory variable, Independent variable, predictor variable, predictor, regressor, stimulus/control variable, exogenous Disturbance (random error) term, residual, residual error 4 Causation / correlation • Regression vs causation – “A statistical relationship, however strong and however suggestive, can never establish causal connection: our ideas of causation must come from outside statistics” Gujarati (1995: 20) • Regression vs correlation – Correlation analysis: seeks to measure the strength of linear association between two variables – Regression analysis: seeks to estimate or predict the average value of one variable on the basis of fixed values of other variables 5 Why study regression? • Adjusting for baseline characteristics in Economic Evaluation (Nathwani et al. 2004; Manca et al. 2005; Hoch et al 2002) • Predicting/mapping utility-based outcome measures for use in Economic Evaluation (Gray et al. 2006; Kaambwa et al.2011; Sengupta et al 2004) • Predicting costs for use in Economic Evaluation (Smith et al. 2007; Bonizzato et al. 2000; Baumeister et al. 2009) • Constructing CEACs (Hoch et al. 2006) • Regression imputation for missing data (Billingham et al. 2002; Engels & Diehr, 2003; Blazer et al. 1995) • Explaining factors which cause variation in outcome and cost data (Barber &Thomspon, 2004; Kaambwa et al. 2008; Raine et al, 2010) 6 The regression modelling process 1. 2. 3. 4. 5. 6. 7. Statement of hypothesis (theory) Specification of the model Obtaining the data Estimation of the regression model Diagnostic analysis Hypothesis testing Prediction/forecasting 7 1. Statement of hypothesis Example: High Blood Pressure and older people “Amongst those over the age of 65, the incidence of high diastolic blood pressure (dipb) increases with age. Therefore, dipb is, in part, explained by age.” 8 2. Specification of the model In Functional form: Mean Diastolic High Blood Pressure, DIBP, is some function of age, A: DIBP = f (A) (1) 9 2. Specification of the model (cntd) In Mathematical (linear) form: Y = 1 + 2X where (2) Y = Mean DIBP and X = age 1 & 2 = parameters 10 Linear relationship E(Y|X) .. .. x1 .. .. .. . .. . x3 . . . . x4 . . . . . x6 X 11 2. Specification of the model (cntd) Econometric (Regression) model Y = 1 + 2X + u Where (3) Y = Mean DIBP - the dependent variable X = Age - explanatory variable u = Disturbance (random error) term 1 & 2 = parameters 12 The error term (u) • Omitted explanatory variables • Measurement error • Wrong functional form • Unavailability of data • Inherent randomness etc…. 13 3 & 4. Data / estimation of parameters • Obtaining the data – observed values of Y and X • Estimation of the parameters – Y and X are the variables (“known”) – 1, 2 and u are the parameters (“unknown”) 14 5. Diagnostic analysis • Is the model correctly specified? • Have all assumptions been met? • Are there any unusual observations or outliers that may unduly influence results? More of this later this morning… 15 6. Hypothesis Testing • Is estimate statistically close to a postulated value? Or are estimates in accord with expectations from theory? • Only after model has been shown to be adequate 16 7. Forecasting or Prediction • If hypothesis or theory being tested is confirmed, then future values of the dependent variable can be predicted or forecast • Policy recommendations 17 The practice of regression modelling Hypothesis / theory Model specification Data Estimation Specification testing and diagnostic testing Yes Is the model adequate? No Hypothesis testing Policy: prediction and forecasting 18 Sample regression • In practice we will never observe the population regression line. • Instead we take a random sample of observations in order to estimate the s. • We distinguish the sample regression from the population regression as follows: 19 Sample regression Mathematical Model Econometric Model Yˆi ˆ1 ˆ2 X i Yi ˆ1 ˆ2 X i uˆ i where Yˆ = estimator of E(Y/Xi) ˆ1 = estimator of 1 ˆ2 = estimator of 2 uˆ i = estimate of ui 20 Population regression Mathematical Model Yi 1 2 X i ui Yi 1 2 X i where Y Econometric Model = E(Y/Xi) 1 = constant/Y intercept 2 = coefficient for Xi ui = error term 21 .Y Y 4 . Yˆ3 . .Y . 2 Yˆ1 . Y. Yˆi ˆ1 ˆ2 X i Yˆ4 . Y 3 Yˆ2 1 X1 X2 X3 X4 X 22 .Y Y uˆ4 uˆ2 Yˆ1 . uˆ Y. 2 . Yˆi ˆ1 ˆ2 X i ˆi Yˆ4 Yˆ3 . .Y . 4 uˆ . Y 3 3 Yˆ2 1 1 X1 X2 X3 X4 X 23 : The Ordinary Least Squares (OLS) Model Dependent variable is modelled as a linear function of predictor or independent variables. The dependent variable is continuous e.g. Blood pressure, Cholesterol level or Weight . 24 OLS •What factors cause variation in an individual’s Diastolic blood pressure? •What variables explain movement in Men’s cholesterol level? •What variables are predictive of high birth weight in a population of mothers from Birmingham? Dependent variable can take on any numerical value within the limits of the range of that variable. 25 OLS The OLS method seeks to minimise the residual sum of squares: uˆ (Y i ˆ1 ˆ 2 X i) n i 1 n 2 i 2 n i 1 n uˆ (Y i Yˆ i) i 1 2 i 2 i 1 26 Minimising the residual… Y uˆ 4 uˆ 2 . {. uˆ 3 . uˆ1 } . X1 X2 X3 X4 X 27 Describing the overall fit of the estimated model Coefficient of determination, or R2, is a measure of the ‘goodness of fit’ of a regression i.e. the proportion of the variation in Yi which is explained by the regression 2 0< R <1 But focusing solely on maximising R2 is not a good idea! (other measures will be consider this afternoon…) 28 Models for Categorical Dependent Variables For use on dependent variables that are either dichotomous (individual has CVD or not), or polytomous (Low, Medium or High cholesterol level) which are quite common in Health-related datasets 29 Models for Categorical Dependent Variables Focus Binary response variable – independent variables are used to predict whether or not some event will occur: Based on certain described characteristics: Will an individual get cancer or not? Will a patient survive or die? will an individual develop CVD or not? 30 Coding of outcomes: Usually coded 1 if the attribute of interest is present and 0 otherwise. Approach to be used: Logistic regression - best for dichotomous dependent variable, and continuous and categorical independent variables. Other commonly used approaches: Probit & Nested Logit 31 Major difference from Ordinary Linear Regression • Uses link for relationship between dependent and independent variable • Substitute maximum likelihood estimation (MLE) of a link function of the dependent variable for regression's use of least squares estimation of the dependent variable itself. MLE - Method of estimating unknown parameters in such a way that the probability of observing a given dependent variable is as high (or maximum) as possible 32 Issues to consider… • Why are OLS models not suitable for dichotomous data? • Logit transformation – Link Function • Marginal & Conditional Odds and Probability 33 Suppose we want to model Yi = β0 + β1X1+ ε but 1 if the i-th individual has the attribute of interest – e.g. CVD yi = 0, otherwise and • β0 is the coefficient on the constant term, • β1 is the coefficient on the independent variable, • X1 is the independent variable – e.g. Age, and • ε is the error term. 34 Let Yi = 1 if the ith individual has CVD, and 0 otherwise. Let also Yi take the values 1 and 0 with probabilities pi and 1-pi, respectively. i.e. P(Y1=1) = P(CVD =1) = p1 P(Y1=0) = P(CVD =0) = 1- p1 35 Why not just use Simple Linear (OLS) regression? Consider a simple OLS regression model CVD = β0 + β1Age+ ε , Assumptions a) ε ~N(0, δ2) b) var (ε) is constant i.e. Homoscedasticity Binary outcome variables violate these assumptions… 36 Why not just use Simple Linear (OLS) regression? • CVD is binary as P takes on only two values. Consequently, ‘ε’ is also binary and therefore ‘normality of residuals’ assumption is violated. • The error terms are heteroscedastic, so regression assumption that the variance of the error term is constant is violated. • The predicted probabilities can be greater than 1 or less than 0 which can be a problem if the predicted values are used in a subsequent analysis! 37 Logit transformation 1. Move from probabilities to Odds Pi CVD exists Odds 1 Pi CVD doesn't exist 2. Take logs of both sides, to get log-odds or Logit Pi log (odds) logit ( Pi ) log βi Agei 1 Pi or equivalently, exp( β Age ) i i Pi (CVD exists) 1 exp( βi Agei ) 38 The Logit transformation removes the floor restriction 39 Logistic Regression Output Part of this output is in form of Odds, Odds ratios and probability. An understanding of these concepts (both marginal and conditional) is therefore cardinal to interpreting Logistic Regression output Key Question to be explored: What factors determine the probability that an individual will or will not develop CVD? 40 Marginal & Conditional odds. CVD No CVD Column Total Smokers 75 25 100 Non-Smokers 40 60 100 Row Total 115 85 200 • The odds of having CVD are 115/85 = 1.353. This is the marginal or unconditional odds of having CVD. The conditional odds of having CVD, given “smokers” is 75:25, or 3. A smoker is 3.0 times as likely to have CVD than he is not to have it The conditional odds of having CVD, given the category “Non-smokers" is 40:60, or 0.67. A non-smoker is 0.67 times as likely to have CVD than he is not to have it 41 Probability The probability of having CVD is 115/200 = 0.575 The probability of having CVD given that one is a smoker is 75/100 = 0.75 The probability of having CVD given that one is a non-smoker is 40/100 = 0.40 42 Odds Ratio The odds ratio of smokers (numerator) to non-smokers (denominator) for CVD, is 3/0.67= 4.478 (This means that the odds of smokers having CVD are 4.478 times as high as those of non-smokers having CVD) Odds ratio is cross-product ratio i.e. (60* 75) 4.478 (40* 25) When one moves from being a non-smoker to a smoker, the odds of having CVD increase by 347.8% (i.e. from 0.67 odds for non-smokers to 3 for smokers) 43 Alternative interpretation of Odds Ratio • Smokers are 4.478 times more likely to have CVD as nonsmokers • The risk of having CVD is 4.478 times greater for smokers than non-smokers • The odds of CVD for smokers are 347.8% higher than the odds of CVD for non-smokers (4.478 - 1.00) • The predicted odds for smokers are 4.478 times the odds for non-smokers. • A one unit change in the independent variable Smokers (smokers to non-smokers) increases the odds of having CVD by a factor of 4.478. 44 References • Altman D.G. 1991. Practical Statistics For Medical Research (London: Chapman & Hall/CRC) • Gujarati D.N. 1995. Basic Econometrics (New York: McGrawHill, Inc) • Johnston J. and J. DiNardo. 1997. Econometric Methods (London: The McGraw-Hill Companies, Inc) • Long J.S. 1997. Regression Models for Categorical and Limited Dependent. A Volume in the Sage Series for Advanced Quantitative Techniques (Thousand Oaks, CA: Sage Publications • Want, MinQi, James M. Eddy, Eugene C. Fitzhugh. 1995. "Application of Odds Ratio and Logistic Models in Epidemiology and Health Research," Health Values 19 : 59-62. 45 References • Nathwani et al. 2004. “An economic evaluation of a European cohort from a multinational trial of linezolid versus teicoplanin in serious Gram-positive bacterial infections: the importance of treatment setting in evaluating treatment effects” International Journal of Antimicrobial Agents 23: 315–324 • Manca A, Hawkins N, Sculpher M. 2005. “Estimating mean QALYs in trial-based cost-effectiveness analysis: the importance of controlling for baseline utility” Health Economics 14:487-496 • Hoch et al. 2002 “Something old, something new, something blue: a framework for the marriage of health econometrics and costeffectiveness analysis” Health Econ 11:415–430. • Gray et al. 2006, "Estimating the association between SF-12 responses and EQ-5D utility values by response mapping", Med Decis Making., vol. 26, no. 1, pp. 18-29. 46 References • Kaambwa et al. 2011, “Mapping utility scores from the Barthel index", Eur. Journal of Health Economics, DOI: 10.1007/s10198011-0364-5 • Sengupta et al. 2004, "Mapping the SF-12 to the HUI3 and VAS in a managed care population", Med Care.,42,9: 927-937. • Smith et al. 2007. Predicting Costs Of Care In Chronic Kidney Disease: The Role Of Comorbid Conditions. The Internet Journal of Nephrology 4, 1 • Bonizzato et al. 2000, “Community-based mental health care: to what extent are service costs associated with clinical, social and service history variables? Psychological Medicine, 30: 12051215. • Baumeister et al. 2009, “Predictive modeling of health care costs: do cardiovascular risk markers improve prediction? European 47 Journal of Cardiovascular Prevention & Rehabilitation References • Hoch et al. 2006, “Using the net benefit regression framework to construct cost-effectiveness acceptability curves: an example using data from a trial of external loop recorders versus Holter monitoring for ambulatory monitoring of "community acquired" syncope”, BMC Health Services Research, 6:68 • Billingham LJ et al. 2002. “Patterns, costs and cost-effectiveness of care in a trial of chemotherapy for advanced non-small cell lung cancer: evidence from a randomised trial” Lung Cancer 37:219-225 • Engels, J.M. & Diehr, P. 2003, “Imputation of missing longitudinal data: a comparison of methods”, Journal of Clinical Epidemiology 56: 968–976 • Blazer et al. 1995. “Health Services Access and Use among Older Adults in North Carolina:Urban vs Rural Residents” American Journal of Public Health, 85, 10:1384-1390 48 References • Barber, J. & Thomspon, S. 2004, “Multiple regression of cost data: use of generalised linear models”, J Health Serv Res Policy 9:197-204 • Kaambwa, B., Bryan, S., Barton, P., Parker, H., Martin, G., Hewitt, G., Parker, S., & Wilson, A. 2008, "Costs and health outcomes of intermediate care: results from five UK case study sites", Health Soc. Care Community 16: 573 - 581 • Raine et al. 2010, “Social variations in access to hospital care for patients with colorectal, breast, and lung cancer between 1999 and 2006: retrospective analysis of hospital episode statistics”, BMJ 340:b5479 49 Exercises • OLS regression • Logistic Regression 50