Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission Announcements • Proposals Due Today Review: Regression • Regression coefficient formulas: sYX b 2 sX a Y bX • Question: What is the interpretation of a regression slope? • Answer: It indicates the typical increase in Y for any 1-point increase along the X-variable – Note: this information is less useful if the linear association between X and Y is low Example: Education & Job Prestige • The actual SPSS regression results for that data: Model Summary Model 1 R R Square a .521 .272 Adjus ted R Square .271 Estimates of a and b: “Constant” = a = 9.427 Slope for “Year of School” = b = 2.487 Std. Error of the Es timate 12.40 a. Predictors : (Constant), HIGHEST YEAR OF SCHOOL COMPLETED Coefficientsa Model 1 (Cons tant) HIGHEST YEAR OF SCHOOL COMPLETED Uns tandardized Coefficients B Std. Error 9.427 1.418 2.487 .108 Standardi zed Coefficien ts Beta .521 t 6.648 Sig. .000 23.102 .000 a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE • Equation: Prestige = 9.4 + 2.5 Education • A year of education adds 2.5 points job prestige Review: Covariance • Covariance (sYX): Sum of deviation about Y-bar multiplied by deviation around X-bar: N sYX (Y Y )( X i 1 i i X) N 1 • Measures whether deviation (from mean) in X tends is accompanied by similar deviation in Y – Or if cases with positive deviation in X have negative deviation in Y – This is summed up for all cases in the data Review: Covariance • Covariance: based on multiplying deviation in X This point deviates a and Y 4 dev = 3 lot from both means (3)(2.5) = 7.5 2 dev = 2.5 Y-bar = .5 -4 -2 0 -2 2 4 This point deviates very little from X-bar, Y-bar (.4)(-.25) =-.01 X-bar = -1 -4 Review: Covariance and Slope • The slope formula can be written out as follows: bYX sYX 2 sX N (Y Y )( X i 1 bYX i i X) N 1 N (X i 1 i X) N 1 N 2 (Y Y )( X i 1 i N (X i 1 i i X) X) 2 Review: R-Square • The R-Square statistic indicates how well the regression line “explains” variation in Y • It is based on partitioning variance into: • 1. Explained (“regression”) variance – The portion of deviation from Y-bar accounted for by the regression line • 2. Unexplained (“error”) variance – The portion of deviation from Y-bar that is “error” • Formula: 2 YX R 2 YX 2 2 X Y SS REGRESSION s SSTOTAL s s Review: R-Square • Visually: Deviation is partitioned into two parts “Error Variance” 4 Y-bar -4 “Explained Variance” 2 -2 0 Y=2+.5X -2 2 4 Correlation Coefficient (r) • The R-square is very similar to another important statistic: the correlation coefficient (r) – R-square is literally the square of r sYX • Formula for correlation coefficient: r s X sY • r is a measure of linear association • Ranges from –1 to 1 • Zero indicates no linear association • 1 = perfect positive linear association • -1 = perfect negative linear association Correlation Coefficient (r) • Example: Education and Job Prestige • SPSS can calculate the correlation coefficient – Usually listed in a matrix to allow many comparisons Correlations HIGHEST YEAR OF SCHOOL COMPLETED RS OCCUPATIONAL PRESTIGE SCORE Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N RS HIGHEST OCCUPA YEAR OF TIONAL SCHOOL PRESTIG COMPLETED E SCORE 1.000 .521** . .000 1530 1434 .521** 1.000 .000 . 1434 1440 **. Correlation is s ignificant at the 0.01 level (2-tailed). Correlation of “Year of School” and Job Prestige: r = .521 Covariance, R-square, r, and b • Covariance, R-square, r, and b are all similar – All provide information about the relationship between X and Y • Differences: • Covariance, b, and r can be positive or negative – r is scaled from –1 to +1, others range widely • b tells you the actual slope – It relates change in X to change in Y in real units • R-square is like r, but is never negative – And, it tells you “explained” variance of a regression Correlation Hypothesis Tests • Hypothesis tests can be done on r, R-square, b • Example: Correlation (r): linear association • Is observed positive or negative correlation significantly different from zero? – Might the population have no linear association? – Population correlation denoted by greek “r”, rho (r) • H0: There is no linear association (r = 0) • H1: There is linear association (r 0) • We’ll mainly focus on tests regarding slopes • But the process is similar for correlation (r) Correlation Coefficient (r) • Education and Job Prestige hypothesis test: Correlations HIGHEST YEAR OF SCHOOL COMPLETED RS OCCUPATIONAL PRESTIGE SCORE Pears on Correlation Sig. (2-tailed) N Pears on Correlation Sig. (2-tailed) N RS HIGHEST OCCUPA YEAR OF TIONAL SCHOOL PRESTIG COMPLETED E SCORE 1.000 .521** . .000 1530 1434 .521** 1.000 .000 . 1434 1440 **. Correlation is s ignificant at the 0.01 level (2-tailed). Here, asterisks signify that coefficients are significantly different from zero, a=.01 “Sig.” is a p-value: The probability of observing r if r = 0. Compare it to a! Hypothesis Tests: Slopes • Given: Observed slope relating Education to Job Prestige = 2.47 • Question: Can we generalize this to the population of all Americans? – How likely is it that this observed slope was actually drawn from a population with slope = 0? • • • • Solution: Conduct a hypothesis test Notation: slope = b, population slope = b H0: Population slope b = 0 H1: Population slope b 0 (two-tailed test) Example: Slope Hypothesis Test • The actual SPSS regression results for that data: Model Summary Model 1 R R Square a .521 .272 Adjus ted R Square .271 t-value and “sig” (pvalue) are for hypothesis tests about the slope Std. Error of the Es timate 12.40 a. Predictors : (Constant), HIGHEST YEAR OF SCHOOL COMPLETED Coefficientsa Model 1 (Cons tant) HIGHEST YEAR OF SCHOOL COMPLETED Uns tandardized Coefficients B Std. Error 9.427 1.418 2.487 .108 Standardi zed Coefficien ts Beta .521 t 6.648 Sig. .000 23.102 .000 a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE • Reject H0 if: T-value > critical t (N-2 df) • Or, “sig.” (p-value) less than a Hypothesis Tests: Slopes • What information lets us to do a hypothesis test? • Answer: Estimates of a slope (b) have a sampling distribution, like any other statistic – It is the distribution of every value of the slope, based on all possible samples (of size N) • If certain assumptions are met, the sampling distribution approximates the t-distribution – Thus, we can assess the probability that a given value of b would be observed, if b = 0 – If probability is low – below alpha – we reject H0 Hypothesis Tests: Slopes • Visually: If the population slope (b) is zero, then the sampling distribution would center at zero – Since the sampling distribution is a probability distribution, we can identify the likely values of b if the population slope is zero Sampling distribution of the slope b If b=0, observed slopes should commonly fall near zero, too If observed slope falls very far from 0, it is improbable that b is really equal to zero. Thus, we can reject H0. 0 Bivariate Regression Assumptions • Assumptions for bivariate regression hypothesis tests: • 1. Random sample – Ideally N > 20 – But different rules of thumb exist. (10, 30, etc.) • 2. Variables are linearly related – i.e., the mean of Y increases linearly with X – Check scatter plot for general linear trend – Watch out for non-linear relationships (e.g., Ushaped) Bivariate Regression Assumptions • 3. Y is normally distributed for every outcome of X in the population – “Conditional normality” • Ex: Years of Education = X, Job Prestige (Y) • Suppose we look only at a sub-sample: X = 12 years of education – Is a histogram of Job Prestige approximately normal? – What about for people with X = 4? X = 16 • If all are roughly normal, the assumption is met Bivariate Regression Assumptions • Normality: Examine sub-samples at different values of X. Make histograms and check for normality. 12 10 10 8 6 4 8 2 Std. Dev = 1.51 Mean = 3.84 N = 60.00 0 .50 1.50 1.00 2.50 2.00 HAPPY 3.50 3.00 4.50 4.00 5.50 5.00 6.50 6.00 7.50 7.00 8.00 Good 6 12 4 10 8 6 4 2 2 Std. Dev = 3.06 Mean = 4.58 N = 60.00 0 .50 1.50 2.50 3.50 1.00 2.00 0 3.00 4.50 5.50 6.50 4.00 5.00 6.00 7.50 8.50 9.50 7.00 8.00 9.00 10.00 HAPPY 0 INCOME 20000 40000 60000 80000 100000 Not very good Bivariate Regression Assumptions • 4. The variances of prediction errors are identical at every value of X – Recall: Error is the deviation from the regression line – Is dispersion of error consistent across values of X? – Definition: “homoskedasticity” = error dispersion is consistent across values of X – Opposite: “heteroskedasticity”, errors vary with X • Test: Compare errors for X=12 years of education with errors for X=2, X=8, etc. – Are the errors around line similar? Or different? Bivariate Regression Assumptions • Homoskedasticity: Equal Error Variance Examine error at different values of X. Is it roughly equal? 10 8 Here, things look pretty good. 6 4 2 0 0 INCOME 20000 40000 60000 80000 100000 Bivariate Regression Assumptions • Heteroskedasticity: Unequal Error Variance At higher values of X, error variance increases a lot. 10 8 6 This looks pretty bad. 4 2 0 0 20000 10000 INCOME 40000 30000 60000 50000 80000 70000 100000 90000 Bivariate Regression Assumptions • Notes/Comments: • 1. Overall, regression is robust to violations of assumptions – It often gives fairly reasonable results, even when assumptions aren’t perfectly met • 2. Variations of OLS regression can handle situations where assumptions aren’t met • 3. But, there are also further diagnostics to help ensure that results are meaningful… – We’ll discuss them next week. Regression Hypothesis Tests • If assumptions are met, the sampling distribution of the slope (b) approximates a T-distribution • Standard deviation of the sampling distribution is called the standard error of the slope (sb) • Population formula of standard error: sb s N (X i 1 i 2 e X) 2 • Where se2 is the variance of the regression error Regression Hypothesis Tests • Estimating se2 lets us estimate the standard error: N sˆ e e 2 i i 1 N 2 SS ERROR MS ERROR N 2 • Now we can estimate the S.E. of the slope: ŝ b MS ERROR N (X i 1 i X) 2 Regression Hypothesis Tests • Finally: A t-value can be calculated: – It is the slope divided by the standard error t N 2 bYX sb bYX MS ERROR 2 s X ( N 1) • Where sb is the sample point estimate of the standard error • The t-value is based on N-2 degrees of freedom Example: Education & Job Prestige • T-values can be compared to critical t... Coefficientsa Model 1 (Cons tant) HIGHEST YEAR OF SCHOOL COMPLETED Uns tandardized Coefficients B Std. Error 9.427 1.418 2.487 .108 Standardi zed Coefficien ts Beta .521 t 6.648 Sig. .000 23.102 .000 a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE SPSS estimates the standard error of the slope. This is used to calculate a t-value The t-value can be compared to the “critical value” to test hypotheses. Or, just compare “Sig.” to alpha. If t > crit or Sig < alpha, reject H0 Regression Confidence Intervals • You can also use the standard error of the slope to estimate confidence intervals: C.I . b sb (t N 2 ) • Where tN-2 is the t-value for a two-tailed test given a desired a-level • Example: Observed slope = 2.5, S.E. = .10 • 95% t-value for 102 d.f. is approximately 2 • 95% C.I. = 2.5 +/- 2(.10) • Confidence Interval: 2.3 to 2.7 Regression Hypothesis Tests • You can also use a T-test to determine if the constant (a) is significantly different from zero – But, this is typically less useful to do t N 2 aYX MS ERROR ( N 1) • Hypotheses (a = population parameter of a): • H0: a = 0, H1: a 0 • But, most research focuses on slopes Regression: Outliers • Note: Even if regression assumptions are met, slope estimates can have problems • Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample • Outliers can result from: – Errors in coding or data entry – Highly unusual cases – Or, sometimes they reflect important “real” variation • Even a few outliers can dramatically change estimates of the slope (b) Regression: Outliers • Outlier Example: Extreme case that pulls regression line up 4 2 -4 -2 0 -2 -4 2 4 Regression line with extreme case removed from sample Regression: Outliers • • • • Strategy for dealing with outliers: 1. Identify them Look at scatterplots for extreme values Or, ask SPSS to compute outlier diagnostic statistics – There are several statistics to identify cases that are affecting the regression slope a lot – Examples: “Leverage”, Cook’s D, DFBETA – SPSS can even identify “problematic” cases for you… but it is preferable to do it yourself. Regression: Outliers • 2. Depending on the circumstances, either: • A) Drop cases from sample and re-do regression – Especially for coding errors, very extreme outliers – Or if there is a theoretical reason to drop cases – Example: In analysis of economic activity, communist countries differ a lot… • B) Or, sometimes it is reasonable to leave outliers in the analysis – e.g., if there are several that represent an important minority group in your data • When writing papers, identify if outliers were excluded (and the effect that had on the analysis).