Matakuliah Tahun : D0722 - Statistika dan Aplikasinya : 2010 Regresi dan Korelasi Pertemuan 10 Learning Outcomes • Pada akhir pertemuan ini, diharapkan mahasiswa akan mampu : 1. menghubungkan dua variabel dalam analisis regresi dan korelasi linier sederhana 2. dapat menunjukkan hubungan antara variabel berdasarkan hasil uji hipotesis 3 COMPLETE 1-4 BUSINESS STATISTICS 5th edi tion The Simple Linear Regression Model The population simple linear regression model: Y= 0 + 1 X Nonrandom or Systematic Component + Random Component where Y is the dependent variable, the variable we wish to explain or predict X is the independent variable, also called the predictor variable is the error term, the only random component in the model, and thus, the only source of randomness in Y. 0 is the intercept of the systematic component of the regression relationship. 1 is the slope of the systematic component. The conditional mean of Y: E[Y X ] 0 1 X McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-5 BUSINESS STATISTICS Picturing the Simple Linear Regression Model Y Regression Plot E[Y]=0 + 1 X Yi } { Error: i } 1 = Slope The simple linear regression model gives an exact linear relationship between the expected or average value of Y, the dependent variable, and X, the independent or predictor variable: E[Yi]=0 + 1 Xi 1 Actual observed values of Y differ from the expected value by an unexplained or random error: 0 = Intercept Xi McGraw-Hill/Irwin X Aczel/Sounderpandian Yi = E[Yi] + i = 0 + 1 Xi + i © The McGraw-Hill Companies, Inc., 2002 COMPLETE 1-6 BUSINESS STATISTICS 5th edi tion 10-3 Estimation: The Method of Least Squares Estimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line. The estimated regression equation: Y = b0 + b1X + e where b0 estimates the intercept of the population regression line, 0 ; b1 estimates the slope of the population regression line, 1; and e stands for the observed errors - the residuals from fitting the estimated regression line b0 + b1X to a set of n points. The estimated regression line: Y b0 + b1 X (Y - hat) is the value of Y lying on the fitted regression line for a given where Y value of X. McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-7 BUSINESS STATISTICS Fitting a Regression Line Y Y Data X Three errors from the least squares regression line X Y Three errors from a fitted line X McGraw-Hill/Irwin Aczel/Sounderpandian Errors from the least squares regression line are minimized X © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-8 BUSINESS STATISTICS Errors in Regression Y the observeddata point . Yi { Error ei Yi Yi Yi Y b0 b1 X the fitted regression line Yi the predicted value of Y for X i X Xi McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-9 BUSINESS STATISTICS Least Squares Regression The sum of squared errors in regression is: n SSE = n e 2 (y y ) i i i=1 i=1 2 i The least squares regression line is that which minimizes the SSE with respect to the estimates b 0 and b 1 . The normal equations: n y SSE b0 n i nb0 b1 x i i=1 i=1 n x y i i i=1 McGraw-Hill/Irwin n n i=1 i=1 Least squares b0 b0 x i b1 x 2i Least squares b1 Aczel/Sounderpandian At this point SSE is minimized with respect to b0 and b1 b1 © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-10 BUSINESS STATISTICS Sums of Squares, Cross Products, and Least Squares Estimators Sums of Squares and Cross Products: SSx (x x ) x 2 2 x 2 n 2 y 2 2 SS y ( y y ) y n SSxy (x x )( y y ) x ( y ) xy n Least squares regression estimators: SS XY b1 SS X b0 y b1 x McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-11 BUSINESS STATISTICS Error Variance and the Standard Errors of Regression Estimators Y Degrees of Freedom in Regression: df = (n - 2) (n total observations less one degree of freedom for each parameter estimated (b 0 and b1 ) ) 2 ) SS ( 2 XY SSE = ( Y - Y ) SSY SS X = SSY b1SS XY Square and sum all regression errors to find SSE. Example 10 - 1: 2 2 An unbiased estimator of s , denoted by S : SSE SSE = SS Y b1 SS XY 66855898 (1.255333776 )( 51402852 .4 ) 2328161.2 MSE SSE n2 101224 .4 MSE = (n - 2) s McGraw-Hill/Irwin X Aczel/Sounderpandian MSE 2328161.2 23 101224 .4 318.158 © The McGraw-Hill Companies, Inc., 2002 COMPLETE BUSINESS STATISTICS 1-12 5th edi tion Standard Errors of Estimates in Regression The standard error of b0 (intercept): s(b0 ) where s = s x 2 nSS X MSE The standard error of b1 (slope): s(b1 ) McGraw-Hill/Irwin s SS X Aczel/Sounderpandian Example 10 - 1: 2 s x s(b0 ) nSS X 318.158 293426944 ( 25)( 4097557.84 ) 170.338 s s(b1 ) SS X 318.158 40947557.84 0.04972 © The McGraw-Hill Companies, Inc., 2002 COMPLETE BUSINESS STATISTICS 5th edi tion 1-13 Confidence Intervals for the Regression Parameters A (1 - ) 100% confidence interval for b : 0 b t s (b ) 0 ,(n 2 ) 0 2 A (1 - ) 100% confidence interval for b : 1 b t s (b ) 1 ,(n 2 ) 1 2 Least-squares point estimate: b1=1.25533 McGraw-Hill/Irwin 0 b t s (b ) 0 0.025,( 25 2 ) 0 = 274.85 ( 2.069) (170.338) 274.85 352.43 [ 77.58, 627.28] b1 t 0.025,( 25 2 ) s (b1 ) = 1.25533 ( 2.069) ( 0.04972 ) 1.25533 010287 . [115246 . ,1.35820] Height = Slope Length = 1 Example 10 - 1 95% Confidence Intervals: (not a possible value of the regression slope at 95%) Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE BUSINESS STATISTICS 5th edi tion 1-14 Correlation The correlation between two random variables, X and Y, is a measure of the degree of linear association between the two variables. The population correlation, denoted by, can take on any value from -1 to 1. -1 < < 0 0<<1 indicates a perfect negative linear relationship indicates a negative linear relationship indicates no linear relationship indicates a positive linear relationship indicates a perfect positive linear relationship The absolute value of indicates the strength or exactness of the relationship. McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-15 BUSINESS STATISTICS Illustrations of Correlation Y = -1 Y = -.8 Y X McGraw-Hill/Irwin Y X X Y =0 =0 X Aczel/Sounderpandian =1 X Y = .8 X © The McGraw-Hill Companies, Inc., 2002 COMPLETE BUSINESS STATISTICS 5th edi tion 1-16 Covariance and Correlation The covariance of two random variables X and Y: Cov ( X , Y ) E [( X )(Y )] X Y where and Y are the population means of X and Y respectively. X The population correlation coefficient: Cov ( X , Y ) = X Y The sample correlation coefficient * : SS XY r= SS SS X Y *Note: Example 10 - 1: SS XY r= SS SS X Y 51402852.4 ( 40947557.84)( 66855898) 51402852.4 .9824 52321943.29 If < 0, b1 < 0 If = 0, b1 = 0 If > 0, b1 >0 McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-17 BUSINESS STATISTICS Hypothesis Tests for the Correlation Coefficient H0: = 0 H1: 0 (No linear relationship) (Some linear relationship) Test Statistic: t( n 2 ) r 1 r2 n2 Example 10 -1: r t( n 2 ) 1 r2 n2 0.9824 = 1 - 0.9651 25 - 2 0.9824 = 25.25 0.0389 t0. 005 2.807 25.25 H 0 rejected at 1% level McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-18 BUSINESS STATISTICS Hypothesis Tests about the Regression Relationship Constant Y Unsystematic Variation Y Y X Nonlinear Relationship Y X X A hypothesis test for the existence of a linear relationship between X and Y: H0: 1 0 H1: 1 0 Test statistic for the existence of a linear relationship between X and Y: b 1 t (n - 2) s(b ) 1 where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b . 1 1 1 When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom. McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-19 BUSINESS STATISTICS Hypothesis Tests for the Regression Slope Example 10 - 1: H0: 1 0 H1: 1 0 t b 1 s(b ) 1 1.25533 (n - 2) = 25.25 Example10 - 4 : H : 1 0 1 H : 1 1 1 b 1 t 1 ( n - 2) s (b ) 1 1.24 - 1 = 1.14 0.21 0.04972 2.807 25.25 t ( 0 . 005 , 23 ) H 0 is rejected at the 1% level and we may conclude that there is a relationship between charges and miles traveled. McGraw-Hill/Irwin 1.671 1.14 (0.05,58) H is not rejected at the10% level. 0 We may not conclude that the beta coefficien t is different from 1. Aczel/Sounderpandian t © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-20 BUSINESS STATISTICS How Good is the Regression? The coefficient of determination, r2, is a descriptive measure of the strength of the regression relationship, a measure of how well the regression line fits the data. ( y y ) ( y y) ( y y ) Total = Unexplained Explained Deviation Deviation Deviation (Error) (Regression) Y . Y Unexplained Deviation Y Explained Deviation Y } { 2 ( y y ) ( y y) ( y y ) SST = SSE + SSR Total Deviation { 2 r X X McGraw-Hill/Irwin 2 Aczel/Sounderpandian SSR SST 1 SSE SST 2 Percentage of total variation explained by the regression. © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-21 BUSINESS STATISTICS The Coefficient of Determination Y Y Y X X SST r2=0 SSE r2=0.50 SST SSE SSR X r2=0.90 S S E SST SSR 7000 Example 10 -1: SSR 64527736.8 r 0.96518 SST 66855898 2 Dollars 6000 5000 4000 3000 2000 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 Miles McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 5th edi tion 1-22 BUSINESS STATISTICS Analysis of Variance and an F Test of the Regression Model Source of Variation Sum of Squares Degrees of Freedom Mean Square F Ratio Regression SSR (1) MSR Error SSE (n-2) MSE Total SST (n-1) MST MSR MSE Example 10-1 Source of Variation Sum of Squares Degrees of Freedom Regression 64527736.8 1 Mean Square 64527736.8 637.47 101224.4 Error 2328161.2 23 Total 66855898.0 24 McGraw-Hill/Irwin F Ratio p Value Aczel/Sounderpandian 0.000 © The McGraw-Hill Companies, Inc., 2002 COMPLETE 1-23 BUSINESS STATISTICS 5th edi tion Use of the Regression Model for Prediction • Point Prediction A single-valued estimate of Y for a given value of X obtained by inserting the value of X in the estimated regression equation. • Prediction Interval For a value of Y given a value of X • • Variation in regression line estimate Variation of points around regression line For an average value of Y given a value of X • McGraw-Hill/Irwin Variation in regression line estimate Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 1-24 BUSINESS STATISTICS 5th edi tion Prediction Interval for a Value of Y A (1 - ) 100% prediction interval for Y : 1 (x x) yˆ t s 1 n SS 2 2 X Example10 - 1 (X = 4,000) : 1 (4,000 3,177.92) {274.85 (1.2553)(4,000)} 2.069 318.16 1 25 40,947,557.84 2 5296.05 676.62 [4619.43, 5972.67] McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 COMPLETE 1-25 BUSINESS STATISTICS 5th edi tion Prediction Interval for the Average Value of Y A (1 - ) 100% prediction interval for the E[Y X] : 1 (x x) yˆ t s SS n 2 2 X Example10 - 1 (X = 4,000) : 1 (4,000 3,177.92) {274.85 (1.2553)(4,000)} 2.069 318.16 40,947,557.84 25 2 5,296.05 156.48 [5139.57, 5452.53] McGraw-Hill/Irwin Aczel/Sounderpandian © The McGraw-Hill Companies, Inc., 2002 RINGKASAN Regresi : - Bentuk hubungan anatara variabel bebas dengan variabel tak bebas Korelasi: -Keeratan dan arah hubungan antara dua variabel -Uji hipotesis parameter regresi -Uji hipotesis korelasi 26