Linear correlation and linear regression + summary of tests Week 12 Regression Theory Recall: Covariance n cov ( x , y ) ( x X )( y i 1 i n 1 i Y ) Interpreting Covariance cov(X,Y) > 0 X and Y are positively correlated cov(X,Y) < 0 X and Y are inversely correlated cov(X,Y) = 0 X and Y are independent Correlation coefficient Pearson’s Correlation Coefficient is standardized covariance (unitless): r cov ariance( x, y ) var x var y Recall dice problem… Var(x) = = 2.916666 Var(y) = 5.83333 Cov(xy) = 2.91666 R2=“Coefficient of Determination” = SSexplained/TSS r 2.91666 2.91666 5.8333 1 .707 2 R .707 .5 2 2 Interpretation of R2: 50% of the total variation in the sum of the two dice is explained by the roll on the first die. Makes perfect intuitive sense! Correlation • Measures the relative strength of the linear relationship between two variables • Unit-less • Ranges between –1 and 1 • The closer to –1, the stronger the negative linear relationship • The closer to 1, the stronger the positive linear relationship • The closer to 0, the weaker any positive linear relationship Scatter Plots of Data with Various Correlation Coefficients Y Y Y X X r = -1 r = -.6 Y r=0 Y Y r = +1 X X X r = +.3 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall X r=0 Linear Correlation Linear relationships Y Curvilinear relationships Y X Y X Y X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall X Linear Correlation Strong relationships Y Weak relationships Y X Y X Y X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall X Linear Correlation No relationship Y X Y X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall Linear regression http://www.math.csusb.edu/faculty/stanton/m262/regress/regress.html In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y. What is “Linear”? • Remember this: • Y=mX+B? m B What’s Slope? A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y. Simple linear regression P=.22; not significant The linear regression model: intercept Love of Math = 5 + .01*math SAT score slope Prediction If you know something about X, this knowledge helps you predict something about Y. (Sound familiar?…sound like conditional probabilities?) Linear Regression Model Y’s are modeled… Yi= 100*X + random errori Fixed – exactly on the line Follows a normal distribution Assumptions (or the fine print) • Linear regression assumes that… – 1. The relationship between X and Y is linear – 2. Y is distributed normally at each value of X – 3. The variance of Y at every value of X is the same (homogeneity of variances) • Why? The math requires it—the mathematical process is called “least squares” because it fits the regression line by minimizing the squared errors from the line (mathematically easy, but not general—relies on above assumptions). Expected value of y: ŷ i Expected value of y at level of x: xi= ˆ yˆ i ˆ xi Residual ˆ ˆ ˆ ei yi yi yi ( xi ) We fit the regression coefficients such that sum of the squared residuals were minimized (least squares regression). Residual Residual = observed value – predicted value Predicted value 33.5 weeks Residual Analysis: check assumptions ei Yi Yˆi • The residual for observation i, ei, is the difference between its observed and predicted value • Check the assumptions of regression by examining the residuals – – – – Examine for linearity assumption Examine for constant variance for all levels of X (homoscedasticity) Evaluate normal distribution assumption Evaluate independence assumption • Graphical Analysis of Residuals – Can plot residuals vs. X Residual Analysis for Linearity Y Y x x Not Linear residuals residuals x x Linear Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall Residual Analysis for Homoscedasticity Y Y x x Non-constant variance residuals residuals x x Constant variance Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall Residual Analysis for Independence Not Independent X Independent residuals residuals X residuals Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall X As a linear regression… Intercept represents the mean value in the even-day group. It is significantly different than 0—so the average Eng SAT score is not 0. Slope represents the difference in means between odd and even groups. Diff is significant. Parameter Intercept OddDay Standard Estimate 657.5000000 81.7307692 Error t Value Pr > |t| 23.66105065 32.81197359 27.79 2.49 <.0001 0.0204 Multiple Linear Regression • More than one predictor… = + 1*X + 2 *W + 3 *Z Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant. ANOVA is linear regression! A categorical variable with more than two groups: E.g.: groups 1, 2, and 3 (mutually exclusive) = (=value for group 1) + 1*(1 if in group 2) + 2 *(1 if in group 3) This is called “dummy coding”—where multiple binary variables are created to represent being in each category (or not) of a categorical variable Other types of multivariate regression Multiple linear regression is for normally distributed outcomes Logistic regression is for binary outcomes Cox proportional hazards regression is used when time-to-event is the outcome Overview of statistical tests The following table gives the appropriate choice of a statistical test or measure of association for various types of data (outcome variables and predictor variables) by study design. e.g., blood pressure= pounds + age + treatment (1/0) Continuous outcome Continuous predictors Binary predictor Alternative summary: statistics for various types of outcome data Are the observations independent or correlated? Outcome Variable independent correlated Assumptions Continuous Ttest ANOVA Linear correlation Linear regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling Outcome is normally distributed (important for small samples). Outcome and predictor have a linear relationship. Difference in proportions Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Chi-square test assumes sufficient numbers in each cell (>=5) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups (e.g. pain scale, cognitive function) Binary or categorical (e.g. fracture yes/no) Time-to-event (e.g. time to fracture) Continuous outcome (means) Are the observations independent or correlated? Outcome Variable Continuous (e.g. pain scale, cognitive function) independent correlated Alternatives if the normality assumption is violated (and small sample size): Ttest: compares means Paired ttest: compares means Non-parametric statistics between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes between two related groups (e.g., the same subjects before and after) Wilcoxon sign-rank test: Repeated-measures ANOVA: compares changes Wilcoxon sum-rank test (=Mann-Whitney U test): non- over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time non-parametric alternative to the paired ttest parametric alternative to the ttest Kruskal-Wallis test: non- parametric alternative to ANOVA Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient Binary or categorical outcomes (proportions) Are the observations correlated? Outcome Variable Binary or categorical (e.g. fracture, yes/no) independent correlated Alternative to the chisquare test if sparse cells: Chi-square test: McNemar’s chi-square test: Fisher’s exact test: compares Conditional logistic regression: multivariate McNemar’s exact test: compares proportions between more than two groups compares binary outcome between correlated groups (e.g., before and after) Relative risks: odds ratios or risk ratios Logistic regression: multivariate technique used when outcome is binary; gives multivariate-adjusted odds ratios regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) proportions between independent groups when there are sparse data (some cells <5). compares proportions between correlated groups when there are sparse data (some cells <5). Time-to-event outcome (survival data) Are the observation groups independent or correlated? Outcome Variable Time-toevent (e.g., time to fracture) independent correlated Kaplan-Meier statistics: estimates survival functions for n/a (already over time) each group (usually displayed graphically); compares survival functions with log-rank test Cox regression: Multivariate technique for time-to-event data; gives multivariate-adjusted hazard ratios Modifications to Cox regression if proportionalhazards is violated: Time-dependent predictors or timedependent hazard ratios (tricky!)