Lecture on Correlation and Regression Analyses REVIEW - Variable A variable is a characteristic that changes or varies over time or different individuals or objects under consideration. Broad Classification of Variables: QUANTITATIVE DISCRETE CONTINUOUS QUALITATIVE Types of Variable Qualitative assumes values that are not numerical but can be categorized categories may be identified by either nonnumerical descriptions or by numeric codes Types of Variable Quantitative indicates the quantity or amount of a characteristic data are always numeric can be discrete or continuous Types of Quantitative Variables Discrete – variable with a finite or countable number of possible values Continuous – variable that assumes any value in a given interval 2.A.5 Levels/Scales of Measurement Data may be classified into four hierarchical levels of measurement: Nominal Ordinal Interval Ratio Note: The type of statistical analysis that is appropriate for a particular variable depends on its level of measurement. NOMINAL SCALE Data collected are labels, names or categories. Frequencies or counts of observations belonging to the same category can be obtained. It is the lowest level of measurement. ORDINAL SCALE Data collected are labels with implied ordering. The difference between two data labels is meaningless. INTERVAL SCALE Data can be ordered or ranked. The difference between two data values is meaningful. Data at this level may lack an absolute zero point. RATIO SCALE Data have all the properties of the interval scale. The number zero indicates the absence of the characteristic being measured. It is the highest level of measurement. Learning Points – PART II 1. 2. 3. 4. 5. What is a correlation analysis What is a regression analysis When do we use correlation analysis? When do we use regression analysis? How do we compare regression versus correlation analysis? CORRELATION ANALYSIS It is a statistical technique used to determine the strength of the relationship between two variables, X and Y. It provides a measure of strength of the linear relationship between two variables measured in at least interval scale. 5.F.12 ILLUSTRATION The UP Admissions office may be interested in the relationship between UPCAT scores in Math and Reading Comprehension of UPCAT qualifiers. 5.F.13 ILLUSTRATION A social scientist might be concerned with how a city’s crime rate is related to its unemployment rate. 5.F.14 ILLUSTRATION A nutritionist might try to relate the quantity of carbohydrates in the diet consumed to the amount of sugar in the blood of diabetic individuals. 5.F.15 PEARSON’S CORRELATION COEFFICIENT, s XY , s XsY 1 1 where sXY = covariance between X and Y N X i X Yi Y N sX = standard deviation of the X values sY = standard deviation of the Y values N = number of paired observations in the population i 1 5.F.16 PEARSON’S CORRELATION COEFFICIENT, Y as X Y X and Y increases (decreases) together, >0 X 5.F.17 PEARSON’S CORRELATION COEFFICIENT, X increases (decreases) while Y decreases (increases), < 0 Y as X Y X 5.F.18 PEARSON’S CORRELATION COEFFICIENT, No pattern Y X and Y have no linear relationship, = 0 X 5.F.19 SAMPLE CORRELATION COEFFICIENT, r s XY r , s X sY where n s XY = i=1 sXY sX sY n X i - X Yi - Y n-1 -1 r 1 n n X iYi = n X Y i i=1 i=1 n-1 i i=1 n , = sample covariance of X and Y values = sample standard deviation of X values = sample standard deviation of Y values = sample size 5.F.20 QUALITATIVE INTERPRETATION OF AND r Absolute Value of the Correlation Coefficient Strength of Linear Relationship 0.0 – 0.2 Very weak 0.2 – 0.4 Weak 0.4 – 0.6 Moderate 0.6 – 0.8 Strong 0.8 – 1.0 Very Strong 5.F.21 EXAMPLE It is of interest to study the relationship between the number of hours spent studying and the student’s grade in an examination. A random sample of twenty students is selected and the data are given in the following table. Compute and interpret the sample correlation coefficient. EXAMPLE Student Hours Studied Score (%) Student Hours Studied Score (%) 1 2 3 4 5 6 7 8 9 10 3 5 2 3 4 2 3 4 3 4 71 90 83 70 93 50 70 90 76 80 11 12 13 14 15 16 17 18 19 20 4 3 1 0 3 1 1 2 3 1 80 60 63 49 80 61 63 50 60 65 Slide No. V.F.15 SCATTER PLOT Examination Score 100 90 80 70 60 50 40 0 1 2 3 4 5 Number of Hours Spent Studying 6 sXY r , s X sY -1 r 1 2 Xi n 2 i=1 X i n 2 i=1 sx 1.7263 -->> sx 1.313893 n-1 s y 13.53203 n n n X iYi s XY = X Y i i=1 i=1 5.F.25 n n-1 i i=1 n (52)(1404) 3901 20 13.1895 19 Sample Correlation Coefficient Sample X Y XY Total Standard Deviation Variance Covariance 52 1404 3901 1.3139 13.5320 1.7263 183.1158 13.1895 s XY 13.1895 r 0.7418 s X sY 1.3139 13.5320 Interpretation: There is a strong positive linear relationship between the number of hours the student spent studying for the exam and exam score of students. TEST OF HYPOTHESIS ABOUT Ho: = 0; There is no linear relationship between X and Y. vs. Ha: 0; There is a linear relationship between X and Y. or Ha: > 0; There is a positive linear relationship between X and Y. or Ha: < 0; There is a negative linear relationship between X and Y. 5.F.27 TEST OF HYPOTHESIS ABOUT The standardized form of the test statistic is tc r n2 1 r2 which follows the Student’s t distribution with n - 2 df when the null hypothesis is TRUE. This is commonly referred to as ttest for correlation coefficient. 5.F.28 TEST OF HYPOTHESIS ABOUT With a given level of significance, a Alternative Hypothesis Ha: ≠ 0 (two-tailed test) Ha: >0 (one-tailed test) Ha: <0 (one-tailed test) 5.F.29 Decision Rule Reject Ho if |tc| > ttab= tα/2(n-2). Fail to reject Ho, otherwise. Reject Ho if tc > ttab= tα(n-2). Fail to reject Ho, otherwise. Reject Ho if tc < ttab= - tα(n-2). Fail to reject Ho, otherwise. EXAMPLE In the study of the relationship between the number of hours spent studying and the student’s grade in an examination. Is there evidence to say that longer number of hours spent studying is associated with higher exam scores at 5% level of significance? Test of Hypothesis Ho: = 0; There is no linear relationship between the number of hours a student spent studying for the exam and his exam score. Ha: > 0; There is a positive linear relationship between the number of hours a student spent studying for the exam and his exam score. Test of Hypothesis The test statistic is tc r n2 1 r2 ~ Student's tn 2 Test procedure: One-tailed t-test for correlation coefficient Decision rule: Reject Ho if tc > t.tab = t.05(18) = 1.734. Reject Ho, otherwise. Test of Hypothesis tc r n2 1 r 2 0.7418 20 2 1 0.7418 2 4.6929 Decision: Reject Ho. Conclusion: At α=5%, there is evidence to say that longer number of hours spent studying is associated with higher exam scores. WORD OF CAUTION Correlation is a measure of the strength of linear relationship between two variables, with no suggestion of “cause and effect” or causal relationship. A correlation coefficient equal to zero only indicates lack of linear relationship and does not discount the possibility that other forms of relationship may exist. 5.F.34 REGRESSION ANALYSIS A statistical technique used to study the functional relationship between variables which allows predicting the value of one variable, say Y, given the value of another variable, say X 5.F.35 REGRESSION ANALYSIS Y – dependent variable 5.F.36 A variable whose variation/value depends on that of another. X – independent variable - A variable whose variation/value does not depend on that of another. ILLUSTRATION The relationship between the number of hours spent studying and the student’s exam score may be expressed in equation form. This equation may be used to predict the student’s exam score knowing the number of hours the student spent studying. 5.F.37 ILLUSTRATION A child’s height is studied to see whether it is related to his father’s height such that some equation can be used to predict a child’s height given his father’s height. Sales of a product may be related to the corresponding advertising expenditures. 5.F.38 SAMPLE REGRESSION MODEL Yˆi b0 b1 X i where b0 = estimated Y-intercept; the predicted value of Y when X = 0; b1 = estimated slope of the line; measures the change in the predicted value of Y per unit change in X 5.F.39 ESTIMATORS s XY b1 2 sX b0 Y b1 X where Y = mean of the Y values X = mean of the X values n sˆ 2 sY2 | X i 1 2 ˆ Yi Yi n 1 sY2 b1s XY n2 n2 = estimated common variance of the Y’s 5.F.40 EXAMPLE In the previous example, we may want to predict the examination score of a student given the number of hours he spent studying. b1 s XY s 2 X 13.1895 1.7263 7.6403 b0 Y b1 X 70.2 - (7.6403)2.6 50.3352 Estimated regression line: Yˆ 50.3352 7.6403 X i i Predicted exam score for Xi = 2.5 is 69.44 ~ 69 EXAMPLE Student Hours Studied Score (%) Student Hours Studied Score (%) 1 2 3 4 5 6 7 8 9 10 3 5 2 3 4 2 3 4 3 4 71 90 83 70 93 50 70 90 76 80 11 12 13 14 15 16 17 18 19 20 4 3 1 0 3 1 1 2 3 1 80 60 63 49 80 61 63 50 60 65 Slide No. V.F.15 TEST OF HYPOTHESIS ABOUT b1 Ho: b1 = b * 1 Ha: b1 b * 1 or Ha: b1 > b Ha: b1 < b or 5.F.43 * 1 * 1 where b the hypothesized value of b1 * 1is TEST OF HYPOTHESIS ABOUT b1 The standardized form of the test statistic is b1 b tc se b1 * 1 sY / X where se b1 s X 1 n 1 and it follows the Student’s t distribution with n -2 df when the null hypothesis is TRUE. This is commonly referred to as t-test for regression coefficient. 5.F.44 TEST OF HYPOTHESIS ABOUT b1 With a given level of significance, a Alternative Hypothesis Ha: b1 ≠ b1* (two-tailed test) * Ha: b1 > b1 (one-tailed test) Ha: b1 < b1* (one-tailed test) 5.F.45 Decision Rule Reject Ho if |tc| > ttab= tα/2(n-2). Fail to reject Ho, otherwise. Reject Ho if tc > ttab= tα(n-2). Fail to reject Ho, otherwise. Reject Ho if tc < ttab= -tα(n-2). Fail to reject Ho, otherwise. EXAMPLE Using the previous example, test at a = 5% if a student’s examination score will increase by at least 1 percent with an additional hour of study time. Ho: b1 1 < Test statistic: tc Ha: b1 b1* seb1 b1 1 > ~ Student's tn 2 Test procedure: One-tailed t-test for regression coefficient EXAMPLE Decision Rule: Reject Ho if tc > t.tab = -t.05(18) = -1.734 Otherwise, Fail to reject Ho, Computations: sY | X n 1 sY2 b1s XY n2 20 1 183.1158 7.6403 13.1895 se b1 20 2 sY | X sX 9.3230 1 9.3230 1 1.6279 n 1 1.3139 20 1 EXAMPLE tc b1 b * 1 se b1 7.6403 1 1.6279 4.0791 Decision: Since tc= 4.0791 > t.tab= -1.734, we reject Ho. Conclusion: At a=5%, the student’s exam score will increase by at least 1 percent for an additional hour of study time. TEST OF HYPOTHESIS ABOUT b0 Ho: b0 = b where b is the hypothesized value of b0 * 0 Ha: b0 b * 0 or Ha: b0 > b 0* or Ha: b0 < 5.F.49 b * 0 * 0 TEST OF HYPOTHESIS ABOUT b0 The standardized form of the test statistic is b0 b tc se b0 * 0 where seb0 sY | X sX i 1 X i2 n nn 1 and it follows the Student’s t distribution with n -2 df when the null hypothesis is TRUE. This is commonly referred to as t-test for regression constant. 5.F.50 TEST OF HYPOTHESIS ABOUT b0 With a given level of significance, a Alternative Hypothesis Ha: b0 ≠ b 0* (two-tailed test) Ha: b0 > b 0* (one-tailed test) * Ha: b0 < b 0 (one-tailed test) 5.F.51 Decision Rule Reject Ho if |tc| > ttab= tα/2(n-2). Fail to reject Ho, otherwise. Reject Ho if tc > ttab= tα(n-2). Fail to reject Ho, otherwise. Reject Ho if tc < ttab= - tα(n-2). Fail to reject Ho, otherwise. EXAMPLE At a = 5%, test if the data indicate that the student will fail (a score less than 60) if he did not study. Ho: b 0 60 > Test statistic: tc Ha: b 0 60 * b0 b 0 seb0 ~ Student's tn 2 Test procedure: One-tailed t-test for regression constant EXAMPLE Decision rule: Reject Ho if tc > t.05(18) = -1.734 otherwise, Fail to reject Ho Computations: n se b0 sY | X sX 2 X i 9.3230 n(n 1) 1.3139 i 1 168 4.7180 20 20 1 + 2.0485 b0 b 50.3352 60 tc 2.0485 se b0 4.7180 * 0 40 EXAMPLE Decision: Since tc= 2.0485 > ttab= -1.734, we reject Ho. Conclusion: At a = 5%, the student will get a score less than 60 or the student will fail if he/she did not study for the examination. ADEQUACY OF THE MODEL Coefficient of Determination (R2) - proportion of the total variation in Y that is explained by X, usually expressed in percent s XY R r 100% b1 2 s s sY 2 2 2 XY 2 2 X Y s 100% 5.F.55 EXAMPLE 13.1895 R b1 2 100% 7.6403 100% 55.03% sY 183.1158 2 sXY Interpretation: Around 55% of the total variation in examination scores is explained by the number of hours spent studying. The remaining 45% is explained by other variables not in the model, or by the fact that the relationship is not exactly linear. SUMMARY 1. 2. 3. 4. Correlation analysis Regression analysis Application with computer output Interpretation Regression analysis is a causality relationship, where you can predict the value of one variable given the values of the other variable/s. Correlation analysis is a relationship between two variables but without the causality clause. Regression analysis in policy analysis is usually used to forecast certain events. For example, our trend line is an example of a regression analysis. Illustrations: Knowing the effect of TV spot advertising on the number of people visiting the Family Planning clinic would allow the population commission official to decide rationally whether or not to increase the amount to be spent on TV spot advertising. The officer would be able to predict how many people the commission would be able to attract to the Family Planning clinic if it increased the number of TV ads run. (See series p.176) The relationship between two variables (in our example, the number of TV ad runes and the number of people visiting Family Planning clinic can be summarized by a line. This is called the regression line. This is the line that we will use to predict the value of one variable, given the other. Formula of the regression line: Y a bX e Where: b = the slope of the line; a = the Y intercept or the value of Y when x=0; e = the error term. Example: Relationship between TV ads and number of people visiting the family planning clinic: Municipalities Number of TV ads (X) Number of people visiting the clinic (Y) 1 7 42 2 5 32 3 1 10 4 8 40 5 10 61 6 2 8 7 6 35 8 7 39 9 8 48 10 9 51 11 5 30 12 7 45 13 8 41 14 2 7 15 6 37 16 5 33 b N XY X Y N X X 2 2 Y b X a N 16 3960 96 559 b 2 16 676 96 559 5.76 96 a 9216 1600 5.76 0.4 16 The equation of the line is Y= 0.4 + 5.76 X If X= 5, our predicted value for Y will be Y= .4+ 5.76 (5) = 29.2 If X=7, our predicted value for Y will be Y= .4+ 5.76 (7)= 40.7 Interpretation: An increase of one in the number of TV ad runs will generate a 5.76 increase in the number of people visiting the family planning clinic. So the family planning officer can now proceed with evaluating the cost effectiveness of the program ads. Coefficient of Determination The coefficient of determination is the percent variation in Y explained or accounted for by the variability of X. It is derived by squaring R and multiplying by 100. It is expressed in percentage term. Thus, if R= .9, the coefficient of determination will be 81%. Formula: R N XY X Y N X X 2 2 N Y 2 Y 2 Hypothesis Testing for a and b We use the t-statistic to test the Hypothesis that a and b are significantly different from zero. Excel analysis of the problem Summary Output Regression Statistics Multiple R 0.972499 R Square 0.945755 Adjusted R Square 0.941582 Standard Error 3.796237 Observations 15 Revised Figure ANOVA df Regression Residual Total 1 13 14 SS MS F Significance F 3266.3853266.385226.6526 1.32E-09 187.348414.41141 3453.733 CoefficientsStandard Error Intercept 0.373989 7 5.745957 t Stat P-value 2.4675740.151562 0.88186 0.38166515.05499 1.32E-09 dF: k, n-(k+1), n-1 Lower 95% Upper 95%Lower 95.0%Upper 95.0% -4.95688 5.704857 4.921421 6.570493 -4.95688 4.921421 5.704857 6.570493 DUMMY VARIABLE Represents nominal or categorical variable in the regression model For Example: Y= b0 + b1X1 + b2X2 Y= scores, X1=hours spent in studying, X2=M/F taking a value of 1 if male, otherwise 0