AMS 572 Group Project Simple Linear Regression & Correlation Instructor: Prof. Wei Zhu 11/21/2013 Outline 1. Motivation & Introduction – Lizhou Nie 2. A Probabilistic Model for Simple Linear Regression – Long Wang 3. Fitting the Simple Linear Regression Model – Zexi Han 4. Statistical Inference for Simple Linear Regression – Lichao Su 5. Regression Diagnostics – Jue Huang 6. Correlation Analysis – Ting Sun 7. Implementation in SAS – Qianyi Chen 8. Application and Summary – Jie Shuai 1. Motivation Fig. 1.2 Obama & Romney during Presidential Election Campaign http://outfront.blogs.cnn.com/2012/08/14/the-most-negative-incampaign-history/ Fig. 1.1 Simplified Model for Solar System http://popperfont.net/2012/11/13/the-ultimate-solar-system-animated-gif/ Introduction • Regression Analysis Linear Regression: Simple Linear Regression: {y, x} Multiple Linear Regression: {y; x1, … , xp} Multivariate Linear Regression: {y1, … , yn; x1, … , xp} • Correlation Analysis Pearson Product-Moment Correlation Coefficient: Measurement of Linear Relationship between Two Variables History Adrien-Marie Legendre • • George Carl Sir Friedrich Francis UdnyGalton Yule Gauss Karl Further Development of & Coining Pearson the Term Earliest Form of Least Square Extention to a Theory “Regression” Regression: Least including Gauss-Markov More Generalized Square Method Theorem Context Statistical http://en.wikipedia.org/wiki/Regression_analysis http://en.wikipedia.org/wiki/Adrien_Marie_Legendre http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss http://en.wikipedia.org/wiki/Francis_Galton http://www.york.ac.uk/depts/maths/histstat/people/yule.gif http://en.wikipedia.org/wiki/Karl_Pearson 2. A Probabilistic Model Simple Linear Regression - Special Case of Linear Regression - One Response Variable to One Explanatory Variable General Setting - We Denote Explanatory Variable as Xi’s and Response Variable as Yi’s - N Pairs of Observations {xi, yi}, i = 1 to n 2. A Probabilistic Model Sketch the Graph 1 2 3 4 X 37.70 16.31 28.37 -12.13 Y 9.82 5.00 9.27 2.98 ⋮ 98 99 100 ⋮ 9.06 28.54 -17.19 ⋮ 7.34 10.37 2.33 (29, 5.5) 2. A Probabilistic Model In Simple Linear Regression, Data is described as: Where ~ N(0, ) The Fitted Model: Where - Intercept - Slope of Regression Line 3. Fitting the Simple Linear Regression Model Table 3.1. Groove depth (in mils) 450.00 400.00 Milage(in 1000 miles) Groove Depth (in mils) 350.00 0 394.33 4 329.50 200.00 8 291.00 150.00 12 255.17 16 229.33 20 204.83 24 179.00 28 163.83 32 150.33 300.00 250.00 100.00 50.00 0.00 0 5 10 15 20 25 30 Mileage (in 1000 miles) Fig 3.1. Scatter plot of tire tread wear vs. mileage. From: Statistics and Data Analysis; Tamhane and Dunlop; Prentice Hall. 35 3. Fitting the Simple Linear Regression Model The difference between the fitted line and real data is ei 450.00 ei 400.00 Groove depth (in mils) 350.00 is the vertical distance between fitted line and the real data ei yi yi 300.00 250.00 ei = yi - ( b0 + b1 xi ),(i =1, 2,....., n) ei 200.00 Our goal: minimize the sum of square 150.00 n 2 Q yi 0 1 xi 100.00 50.00 i 1 0.00 0 5 10 15 20 Mileage (in 1000 miles) Fig 3.2. 25 30 35 3. Fitting the Simple Linear Regression Model Least Square Method n ¶Q = -2å[ yi - ( b0 + b1 xi )] = 0 ¶b0 i=1 ¶Q = -2å xi [ yi - ( b0 + b1xi )] = 0 ¶b1 i=1 n n n i 1 i 1 0 n 1 xi yi n n n i 1 i 1 i 1 0 xi 1 xi2 xi yi 3. Fitting the Simple Linear Regression Model n 0 n n n i 1 n i 1 n i 1 ( x )( yi ) ( xi )( xi yi ) 2 i i 1 n x ( xi ) 2 i i 1 1 2 i 1 n n n i 1 i 1 i 1 n xi yi ( xi )( yi ) n n n x ( xi ) i 1 2 i i 1 2 3. Fitting the Simple Linear Regression Model To simplify, we denote: n n 1 n S xy ( xi x )( yi y ) xi yi ( xi )( yi ) n i 1 i 1 i 1 i 1 n n n n n n 1 2 2 2 S xx ( xi x ) xi ( xi ) n i 1 i 1 i 1 n 1 2 2 2 S yy ( yi y ) yi ( yi ) n i 1 i 1 i 1 Ù b0 = y - b 1 x Ù b1 = S xy S xx 3. Fitting the Simple Linear Regression Model Back to the example: å x =144,å y = 2,197.32,å x i 2 i i = 3,264, å yi2 = 589,887.08, å xi yi = 28,167.72 n = 9 x =16, y = 244.15 n n n S xy = å xi yi - (å xi )(å yi ) = 28,167.72 - (144 ´ 2,197.32) = -6,989.40 n i=1 9 i=1 i=1 1 n 1 n S xx = å x - (å xi ) = 3264 - (144) = 960 n i=1 9 i=1 2 i 1 2 1 2 -6989.40 ˆ b1 = = -7.281 and bˆ0 = 244.15+ 7.281´16 = 360.64 960 3. Fitting the Simple Linear Regression Model Therefore, the equation of fitted line is: yˆ = 360.64 - 7.281x Not enough! 3. Fitting the Simple Linear Regression Model Check the goodness of fit of LS line We define: SST = SSR + SSE n n n Prove: n SST = å( yi - y) = å( yˆi - y) + å ( yi - yˆi ) + 2å( yi - yˆi )( yˆi - y) 2 i=1 2 i=1 i=1 SSR The ratio: r SSR SSE r = = 1SST SST 2 2 2 i=1 SSE =0 SST: total sum of squares SSR: Regression sum of squares SSE: Error sum of squares is called the coefficient of determination 3. Fitting the Simple Linear Regression Model Check the goodness of fit of LS line Back to the example: n 1 n 1 SST = S yy = å y - (å yi )2 = 589,887.08 - (2197.32)2 = 53,418.73 n i=1 9 i=1 2 i SSR = SST - SSE = 53,418.73- 2531.53 = 50,887.20 50,887.20 r = = 0.953 53,418.73 2 r = - 0.953 = -0.976 where the sign of r follows from the sign of bˆ1 since 95.3% of the variation in tread wear is accounted for by linear regression on mileage, the relationship between the two is strongly linear with a negative slope. 3. Fitting the Simple Linear Regression Model r is the sample correlation coefficient between X and Y: rX ,Y = S XY S XX · SYY For the simple linear regression, r 2 X ,Y =r 2 3. Fitting the Simple Linear Regression Model 2 Estimation of 2 The variance measures the scatter of the Yi around their means mi = b0 + b1 xi An unbiased estimate of 2 is given by n 2 e i SSE s n2 n2 2 i 1 3. Fitting the Simple Linear Regression Model From the example, we have SSE=2351.3 and n-2=7,therefore S2 = 2351.53 = 361.65 7 Which has 7 d.f. The estimate of is s = 361.65 =19.02 4. Statistical Inference For SLR Under the normal error assumption * Point estimators: 0 , 1 * Sampling distributions of 0 and 1: E(ˆ0 ) 0 E(ˆ ) 1 1 2 x i 2 ˆ 0 ~ N 0, nS xx 2 ˆ 1 ~ N 1, Sxx xi 2 SE ( 0 ) s nS xx s SE (1 ) S xx 22 Derivation ( xi x ) E (Yi ) ˆ E ( 1 ) S xx i 1 n ( xi x ) E ( 0 1 xi ) S xx i 1 n n ( xi x ) ( x x ) xi 0 1 i S xx S xx i 1 i 1 n n 1 n ( xi x ) xi ( xi x ) x S xx i 1 i 1 1 n ( xi x ) 2 1 S xx i 1 2 xi x ˆ Var (Yi ) Var ( 1 ) i 1 S xx n xi x i 1 S xx n 2 2 2 S xx 2 n 2 ( x x ) i i 1 2 S xx S xx 2 2 S xx Derivation E ( ˆ0 ) E (Y ˆ1 x ) E (Y ) E ( ˆ ) x i 1 n E ( 0 1 xi ) n 0 1 x i 0 n n 1 x 1 x Var ( ˆ0 ) Var (Y ˆ1 x ) 2 Var (Y ) x Var ( ˆ ) 1 2 x 2 2 n S xx 2 ( x x ) x n x i i 2 nS xx 2 xi 2 nSxx For mathematical derivations, please refer to the Tamhane and Dunlop text book, P331. Statistical Inference on β0 and β1 * Pivotal Quantities (P.Q.’s): ˆ 0 0 ~ tn 2 SE(ˆ 0) ˆ 1 1 ~ tn 2 SE( ˆ 1) * Confidence Intervals (C.I.’s): 0 t SE 0 , 1 t SE 1 n 2, n 2, 2 2 25 26/69 Hypothesis tests: . H 0 : 1 10 vs . H a : 1 10 0 ˆ 1 1 tn2, / 2 Reject H 0 at level if t0 SE( ˆ1 ) A useful application is to show whether there is a linear relationship between x and y H 0 : 1 0 vs . H a : 1 0 ˆ1 Reject H 0 at level if t0 tn 2, / 2 SE(ˆ1 ) 27/69 Analysis of Variance (ANOVA) Mean Square: A sum of squares divided by its degrees of freedom. SSR MSR 1 and SSE MSE n2 MSR SSR ˆ S xx ˆ1 2 2 s/ S MSE s s xx 2 1 f1,n2, t 2 ˆ1 t02 F0 SE( ˆ ) 1 2 n2, / 2 2 Analysis of Variance (ANOVA) ANOVA Table Source of Variation (Source) Sum of Squares (SS) Degrees of Freedom (d.f.) Regression SSR 1 Error SSE n-2 Total SST n-1 Mean Square (MS) SSR 1 SSE MSE= n2 MSR= 28 F MSR F= MSE 5. Regression Diagnostics 5.1 Checking the Model Assumptions 5.1.1 Checking for Linearity 5.1.2 Checking for Constant Variance 5.1.3 Checking for Normality Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations 5.2.1 Checking for Outliers 5.2.2 Checking for Influential Observations 5.2.3 How to Deal with Outliers and Influential Observations 5. Regression Diagnostics 5.1 Checking the Model Assumptions 5.1.1 Checking for Linearity 5.1.2 Checking for Constant Variance 5.1.3 Checking for Normality Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations 5.2.1 Checking for Outliers 5.2.2 Checking for Influential Observations 5.2.3 How to Deal with Outliers and Influential Observations 5. Regression Diagnostics 5.1.1 Checking for Linearity i 𝑥𝑖 𝑦𝑖 𝑦𝑖 𝑒𝑖 1 0 394.33 360.64 33.69 2 4 329.50 331.51 -2.01 3 8 291.00 302.39 -11.39 4 12 255.17 273.27 -18.10 5 16 229.33 244.15 -14.82 6 20 204.83 215.02 -10.19 7 24 179.00 185.90 -6.90 8 28 163.83 156.78 7.05 9 32 150.33 127.66 22.67 Table 5.1 The 𝑥𝑖 , 𝑦𝑖 , 𝑦𝑖 , 𝑒𝑖 for the Tire Wear Data Figure 5.1 Scatter plots for, 𝑦𝑖 𝑣𝑠. 𝑥𝑖 , 𝑒𝑖 𝑣𝑠. 𝑥𝑖 for the Tire Wear Data 5. Regression Diagnostics 5.1.1 Checking for Linearity (Data transformation) x x2 x3 x x y y y logy 1/y x y logx y -1/x y x logy x -1/y x logx -1/x x x y y y2 y3 y x x2 x3 x x y y y y2 y3 Figure 5.2 Typical Scatter Plot Shapes and Corresponding Linearizing Transformations 5. Regression Diagnostics 5.1.1 Checking for Linearity (Data transformation) i 𝑥𝑖 𝑦𝑖 ln(𝑦𝑖 ) 𝑦𝑖 𝑒𝑖 1 0 394.33 5.926 374.64 19.69 2 4 329.50 5.807 332.58 -3.08 3 8 291.00 5.688 295.24 -4.24 4 12 255.17 5.569 262.09 -6.92 5 16 229.33 5.450 232.67 -3.34 6 20 204.83 5.331 206.54 -1.71 7 24 179.00 5.211 183.36 -4.36 8 28 163.83 5.092 162.77 1.06 9 32 150.33 4.973 144.50 5.83 Table 5.2 The 𝑥𝑖 , 𝑦𝑖 , ln(𝑦𝑖 ), 𝑦𝑖 , 𝑒𝑖 for the Tire Wear Data Figure 5.2 Scatter plots for,ln(𝑦𝑖 ) 𝑣𝑠. 𝑥𝑖 , 𝑒𝑖 𝑣𝑠. 𝑥𝑖 for the Tire Wear Data 5. Regression Diagnostics 5.1 Checking the Model Assumptions 5.1.1 Checking for Linearity 5.1.2 Checking for Constant Variance 5.1.3 Checking for Normality Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations 5.2.1 Checking for Outliers 5.2.2 Checking for Influential Observations 5.2.3 How to Deal with Outliers and Influential Observations 5. Regression Diagnostics 5.1.2 Checking for Constant Variance Plot the residuals against the fitted value If the constant variance assumption is correct, the dispersion of the 𝑒𝑖 ’s is approximately constant with respect to the 𝑦𝑖 ’s. Figure 5.3 Plots of Residuals 𝑒𝑖 𝑣𝑠. 𝑦𝑖 for Tire Wear Data Figure 5.4 Plots of Residuals 𝑒𝑖 𝑣𝑠. 𝑦𝑖 Corresponding to Different Functional Relationships between Var Y and E(Y) 5. Regression Diagnostics 5.1 Checking the Model Assumptions 5.1.1 Checking for Linearity 5.1.2 Checking for Constant Variance 5.1.3 Checking for Normality Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations 5.2.1 Checking for Outliers 5.2.2 Checking for Influential Observations 5.2.3 How to Deal with Outliers and Influential Observations 5. Regression Diagnostics 5.1.3 Checking for normality Make a normal plot of the residuals They have a zero mean and an approximately constant variance. (assuming the other assumptions about the model are correct) Figure 5.5 Normal Probability Plot for Tire Wear Data (p−value = 0.0097) 5. Regression Diagnostics 5.1 Checking the Model Assumptions 5.1.1 Checking for Linearity 5.1.2 Checking for Constant Variance 5.1.3 Checking for Normality Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations 5.2.1 Checking for Outliers 5.2.2 Checking for Influential Observations 5.2.3 How to Deal with Outliers and Influential Observations 5. Regression Diagnostics Outlier: Influential Observation: an observation that does not follow the general pattern of the relationship between y and x. A large residual indicates an outlier. an influential observation has an extreme x-value, an extreme yvalue, or both. Standardized residuals are given by If we express the fitted value of y as a linear combination of all the 𝑦𝑗 ei* ei SE (ei ) ei 1 ( x x )2 s 1 i n S xx ei , i 1, 2,..., n. s If ei 2 , then the corresponding observation may be regarded as an outlier. * n yˆi hij y j j 1 1 ( xi x ) 2 hii n S xx If hii 2(k 1) / n, then the corresponding observations may be regarded as influential observation. 5. Regression Diagnostics 5.2 Checking for Outliers and Influential Observations i ei* i hii 1 2.8653 1 0.3778 2 -0.4113 2 0.2611 3 -0.5367 3 0.1778 4 -0.8505 4 0.1278 5 -0.4067 5 0.1111 6 -0.2102 6 0.1278 7 -0.5519 7 0.1778 8 0.1416 8 0.2611 9 0.8484 9 0.3778 ei* 2 Table 5.3 Standard residuals & leverage for transformed data Tire Wear Data hii 0.44 MATLAB Code for Regression Diagnostics clear;clc; x = [0 4 8 12 16 20 24 28 32]; y = [394.33 329.50 291.00 255.17 229.33 204.83 179.00 163.83 150.33]; y1 = log(y); %data transformation p = polyfit(x,y,1) %linear regression predicts y from x % p = polyfit(x,log(y),1) yfit = polyval(p,x) %use p to predict y yresid = y - yfit %compute the residuals %yresid = y1 - exp(yfit) %residual for transformed data ssresid = sum(yresid.^2); %residual sum of squares sstotal = (length(y)-1) * var(y); %sstotal rsq = 1 - ssresid/sstotal; %R square normplot(yresid) %normal plot for residuals [h,p,jbstat,critval]=jbtest(yresid) %test normality scatter(x,y,500,'r','.') %generate the scatter plots lsline laxis([-5,35,-10,25]) xlabel('x_i') ylabel('y_i') Title('plot of ...') for i = 1:length(x) % check for outliers p(i) = yresid(i)/std(yresid)/sqrt(1-1/length(x)-(yresid(i)-mean(yresid)^2)/(yresid(i)-mean(yresid))^2) end %check for influential observations for j = 1:length(x) q(i) = 1/length(x)+(x(i)-mean(x))^2/960 end 6.1 Correlation Analysis Why we need this? Regression analysis is used to model the relationship between two variables. But when there is no such distinction and both variables are random, correlation analysis is used to study the strength of the relationship. 6.1 Correlation Analysis- Example Example Flu reported Life expectancy Temperature Economy level People who get flu shot Figure 6.1 Economic growth 6.2 Bivariate Normal Distribution Because we need to investigate the correlation between X,Y 𝑓 𝑥, 𝑦 = 1 2𝜋𝜎𝑥 𝜎𝑦 𝜌 = 𝐶𝑜𝑟𝑟 𝑋, 𝑌 = 1 𝑥 − 𝜇𝑥 exp[− ( 2 1 − 𝜌2 𝜎𝑥 1 − 𝜌2 2 𝑦 − 𝜇𝑦 + 𝜎𝑦 2 − 2𝜌 𝑥 − 𝜇𝑥 𝜎𝑥 𝑦 − 𝜇𝑦 )] 𝜎𝑦 𝐶𝑜𝑣(𝑋, 𝑌) 𝑉𝑎𝑟(𝑋)𝑉𝑎𝑟(𝑌) −1 ≤ 𝜌 ≤ 1 Figure 6.2 Source:http://wiki.stat.ucla.edu/socr/index.php/File:SOCR_BivariateNormal_JS_Activity_Fig7.png 6.2 Why introduce Bivariate Normal Distribution? First, we need to do some computation. 𝜌𝜎𝑌 𝜌𝜎𝑌 𝐸 𝑌 𝑋 = 𝑥 = 𝜇𝑌 − 𝜇 + 𝑥 𝜎𝑋 𝑋 𝜎𝑋 𝑉𝑎𝑟 𝑌 𝑋 = 𝑥 = (1 − 𝜌2 )𝜎𝑌 2 Compare with: 𝜌𝜎𝑌 𝛽0 = 𝜇𝑌 − 𝜇 𝜎𝑋 𝑋 𝜌𝜎𝑌 𝛽1 = 𝜎𝑋 So, if (X,Y) have a bivariate normal distribution, then the regression model is true 6.3 Statistical Inference of r Define the r.v. R corresponding to r But the distribution of R is quite complicated -0.7 f(r) f(r) f(r) r -0.3 r Figure 6.3 0 f(r) r 0.5 r 6.3 Exact test when ρ=0 Test: H0 : ρ=0 , Test statistic: T0 Ha : ρ≠0 r n2 1 r Reject H0 iff 2 t0 tn2, / 2 Example A researcher wants to determine if two test instruments give similar results. The two test instruments are administered to a sample of 15 students. The correlation coefficient between the two sets of scores is found to be 0.7. Is this correlation statistically significant at the .01 level? H0 : ρ=0 , t0 Ha : ρ≠0 0.7 15 2 1 0.7 2 3.534 3.534 = t0 > t13, .005 = 3.012 So, we reject H0 6.3 Note:They are the same! Because 𝑟= 𝑠𝑥 𝛽1 𝑠𝑦 = 𝛽1 𝑆𝑥𝑥 𝑠𝑦𝑦 = 𝛽1 𝑆𝑥𝑥 𝑆𝑆𝑇 2 , 1−𝑟 = 𝑆𝑆𝐸 𝑆𝑆𝑇 = (𝑛−2)𝑠 2 𝑆𝑆𝑇 So 𝑡= 𝑟 𝑛−2 1− 𝑟2 = 𝛽1 𝑆𝑥𝑥 𝑆𝑆𝑇 (𝑛 − 2)𝑆𝑆𝑇 𝛽1 𝛽1 = = (𝑛 − 2)𝑠 2 𝑠/ 𝑆𝑥𝑥 𝑆𝐸(𝛽1 ) We can say H0: β1=0 are equivalent to H0: ρ=0 6.3 Approximate test when ρ≠0 Because that the exact distribution of R is not very useful for making inference on ρ, R.A Fisher showed that we can do the following linear transformation, to let it be approximate normal distribution. That is, 1 1 1 1 1 R , tanh R ln N ln 2 1 R 2 1 n 3 1 6.3 Steps to do the approximate test on ρ 1,H0 : ρ= ρ0 vs. H1 : ρ ≠ ρ0 2, point estimator 3, T.S. 4, C.I 6.4 The pitfalls of correlation analysis Lurking Variable Over extrapolation 7. Implementation in SAS state district democ A voteA expend A expend B prtystrA lexpend A lespend B shareA 1 "AL" 7 1 68 328.3 8.74 41 5.793916 2.167567 97.41 2 "AK" 1 0 62 626.38 402.48 60 6.439952 5.997638 60.88 3 "AZ" 2 1 73 99.61 3.07 55 4.601233 1.120048 97.01 "WI" 8 1 30 14.42 227.82 47 2.668685 5.428569 5.95 … 173 Table7.1 vote example data 7. Implementation in SAS SAS code of the vote example proc corr data=vote1; var F4 F10; run; Pearson Correlation Coefficients, N = 173 Prob > |r| under H0: Rho=0 F4 F4 F10 1.00000 0.92528 Table7.2 correlation coeffients proc reg data=vote1; model F4=F10; label F4=voteA; label F10=shareA; output out=fitvote residual=R; run; 7. Implementation in SAS SAS output Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 41486 41486 1017.70 <.0001 Error 171 6970.77364 40.76476 Corrected Total 172 48457 Root MSE 6.38473 R-Square 0.8561 Dependent Mean 50.50289 Adj R-Sq 0.8553 Coeff Var 12.64230 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Intercept Intercept 1 26.81254 0.88719 30.22 F10 F10 1 0.46382 0.01454 31.90 Table7.3 SAS output for vote example 7. Implementation in SAS Figure7.1 Plot of Residual vs. ShareA for vote example 7. Implementation in SAS Figure7.2 Plot of voteA vs. shareA for vote example 7. Implementation in SAS SAS-Check Homoscedasticity Figure7.3 Plots of SAS output for vote example 7. Implementation in SAS SAS-Check Normality of Residuals Tests for Location: Mu0=0 SAS code: Test proc univariate data=fitvote normal; Student's t var R; Sign qqplot R / normal (Mu=est Sigma=est); run; Signed Rank Statistic p Value t 0 Pr > |t| M -0.5 Pr >= |M| 1.0000 S -170.5 Pr >= |S| Tests for Normality Test Statistic p Value Shapiro-Wilk W 0.952811 Pr < W 0.7395 KolmogorovSmirnov D 0.209773 Pr > D >0.1500 Cramer-von Mises W-Sq 0.056218 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.30325 Pr > A-Sq >0.2500 Table7.4 SAS output for checking normality 1.0000 0.7969 7. Implementation in SAS SAS-Check Normality of Residuals Figure7.4 Plot of Residual vs. Normal Quantiles for vote example 8. Application • Linear regression is widely used to describe possible relationships between variables. It ranks as one of the most important tools in these disciplines. Marketing/business analytics Healthcare Finance Economics Ecology/environmental science 8. Application • Prediction, forecasting or deduction Linear regression can be used to fit a predictive model to an observed data set of Y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of Y, the fitted model can be used to make a prediction of the value of Y. 8. Application • Quantifying the strength of relationship Given a variable y and a number of variables X1, ..., Xp that may be related to Y, linear regression analysis can be applied to assess which Xj may have no relationship with Y at all, and to identify which subsets of the Xj contain redundant information about Y. 8. Application Example 1. Trend line A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. Trend lines are sometimes used in business analytics to show changes in data over time. Figure 8.1 Refrigerator sales over a 13-year period http://www.likeoffice.com/28057/Excel-2007-Formatting-charts 8. Application Example 2. Clinical drug trials Regression analysis is widely utilized in healthcare. The graph shows an example in which we investigate the relationship between protein concentration and absorbance employing linear regression analysis. Figure 8.2 BSA Protein Concentration Vs. Absorbance http://openwetware.org/wiki/User:Laura_Flynn/Notebook/ Experimental_Biological_Chemistry/2011/09/13 Summary Model Assumptions Linearity, Constant Variance & Normality Outliers & Influential Observations Data Transformation Linear Regression Analysis Probabilistic Models Correlation Analysis Least Square Estimate Correlation Coefficient (Bivariate Normal Distribution, Exact T-test, Approximate Z-test. Statistical Inference Acknowledgement & References Acknowledgement • Sincere thanks go to Prof. Wei Zhu References • • • Statistics and Data Analysis, Ajit Tamhane & Dorothy Dunlop. Introductory Econometrics: A Modern Approach, Jeffrey M. Wooldridge,5th ed. http://en.wikipedia.org/wiki/Regression_analysis http://en.wikipedia.org/wiki/Adrien_Marie_Legendre etc. (web links have already been included in the slides)