AMS 572 Group #2 1 2 3 Outline • Jinmiao Fu—Introduction and History • Ning Ma—Establish and Fitting of the model • Ruoyu Zhou—Multiple Regression Model in Matrix Notation • Dawei Xu and Yuan Shang—Statistical Inference for Multiple Regression • Yu Mu—Regression Diagnostics • Chen Wang and Tianyu Lu—Topics in Regression Modeling • Tian Feng—Variable Selection Methods • Hua Mo—Chapter Summary and modern application 4 Introduction • Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. Every value of the independent variable x is associated with a value of the dependent variable 5 Example: The relationship between an adult’s health and his/her daily eating amount of wheat, vegetable and meat. 6 Histor y 7 Karl Pearson (1857–1936) Lawyer, Germanist, eugenicist, mathematician and statistician Correlation coefficient Method of moments Pearson's system of continuous curves. Chi distance, P-value Statistical hypothesis testing theory, statistical decision theory. Pearson's chi-square test, Principal component analysis. 8 Sir Francis Galton FRS (16 February 1822 – 17 January 1911) Anthropology and polymathy Doctoral students Karl Pearson In the late 1860s, Galton conceived the standard deviation. He created the statistical concept of correlation and also discovered the properties of the bivariate normal distribution and its relationship to regression analysis 9 Galton invented the use of the regression line (Bulmer 2003, p. 184), and was the first to describe and explain the common phenomenon of regression toward the mean, which he first observed in his experiments on the size of the seeds of successive generations of sweet peas. 10 The publication by his cousin Charles Darwin of The Origin of Species in 1859 was an event that changed Galton's life. He came to be gripped by the work, especially the first chapter on "Variation under Domestication" concerning the breeding of domestic animals. 11 Adrien-Marie Legendre (18 September 1752 – 10 January 1833) was a French mathematician. He made important contributions to statistics, number theory, abstract algebra and mathematical analysis. He developed the least squares method, which has broad application in linear regression, signal processing, statistics, and curve fitting. 12 Johann Carl Friedrich Gauss (30 April 1777 – 23 February 1855) was a German mathematician and scientist who contributed significantly to many fields, including number theory, statistics, analysis, differential geometry, geodesy, geophysics, electrostatics, astronomy and optics. 13 Gauss, who was 23 at the time, heard about the problem and tackled it. After three months of intense work, he predicted a position for Ceres in December 1801—just about a year after its first sighting—and this turned out to be accurate within a half-degree. In the process, he so streamlined the cumbersome mathematics of 18th century orbital prediction that his work—published a few years later as Theory of Celestial Movement—remains a cornerstone of astronomical computation. 14 It introduced the Gaussian gravitational constant, and contained an influential treatment of the method of least squares, a procedure used in all sciences to this day to minimize the impact of measurement error. Gauss was able to prove the method in 1809 under the assumption of normally distributed errors (see Gauss–Markov theorem; see also Gaussian). The method had been described earlier by Adrien-Marie Legendre in 1805, but Gauss claimed that he had been using it since 1795. 15 Sir Ronald Aylmer Fisher FRS (17 February 1890 – 29 July 1962) was an English statistician, evolutionary biologist, eugenicist and geneticist. He was described by Anders Hald as "a genius who almost single-handedly created the foundations for modern statistical science," and Richard Dawkins described him as "the greatest of Darwin's successors". 16 In addition to "analysis of variance", Fisher invented the technique of maximum likelihood and originated the concepts of sufficiency, ancillarity, Fisher's linear discriminator and Fisher information. 17 18 Probabilistic Model yi : the observed value of the random variable(r.v.) Yi Yi depends on fixed predictor values xi1 , xi 2 , , xik ,i=1,2,3,…,n 0 , 1 , , k unknown model parameters n is the number of observations. i ~ i.i.d N (0, ) 2 19 Fitting the model •LS provides estimates of the unknown model parameters, 0 , 1 , , k which minimizes Q n Q [ yi ( 0 1xi1 2 xi 2 ... kxik )]2 i 1 (j=1,2,…,k) 20 Tire tread wear vs. mileage (example11.1 in textbook) • The table gives the measurements on the groove of one tire after every 4000 miles. • Our Goal: to build a model to find the relation between the mileage and groove depth of the tire. Mileage (in 1000 miles) Groove Depth (in mils) 0 394.33 4 329.50 8 291.00 12 255.17 16 229.33 20 204.83 24 179.00 28 163.83 32 150.33 21 SAS code----fitting the model Data example; Input mile depth @@; Sqmile=mile*mile; Datalines; 0 394.33 4 329.5 8 291 12 255.17 16 229.33 20 204.83 24 179 28 163.83 32 150.33 ; run; Proc reg data=example; Model Depth= mile sqmile; Run; 22 Depth=386.26-12.77mile+0.172sqmile 23 Goodness of Fit of the Model •Residuals • yˆ i ˆi ei yi y (i 1, 2, , n) are the fitted values yˆi ˆ ˆxi1ˆ xi1 ˆ kxik (i 1, 2,..., n) An overall measure of the goodness of fit n min Q SSE ei2 Error sum of squares (SSE): i 1 total sum of squares (SST): SST ( yi y )2 regression sum of squares (SSR): SSR SST SSE 24 25 1. Transform the Formulas to Matrix Notation 26 1 𝑥11 𝑥12 ⋯ 𝑥1𝑘 𝑥21 𝑥22 ⋯ 𝑥2𝑘 1 • 𝑋= ⋮ ⋮ ⋮ ⋱ ⋮ 1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑘 • 𝑋 → 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 • The first column of X 1 1 denotes the constant term 𝛽0 ⋮ 1 (We can treat this as 𝛽0 𝑥𝑖0 with 𝑥𝑖0 = 1.) 27 • Finally let 𝛽0 𝛽1 • 𝛽= ⋮ 𝛽𝑘 and 𝛽0 𝛽 = 𝛽1 ⋮ 𝛽𝑘 where 𝛽 → the (k+1)×1 vectors of unknown parameters 𝛽 → 𝛽′ 𝑠 LS estimates 28 • Formula 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑘 𝑥𝑖𝑘 + 𝜖𝑖 becomes 𝑌 = 𝑋𝛽 + 𝜖 • Simultaneously, the linear equation 𝑛 𝛽0 𝑛 + 𝛽1 𝑛 𝑥𝑖1 + ⋯ + 𝛽𝑘 𝑖=1 𝑛 𝑥𝑖𝑘 = 𝑖=1 𝑦𝑖 𝑖=1 are changed to 𝑋 ′ 𝑋𝛽 = -1𝑋 ′ 𝑦 Solve this equation respect to 𝛽 and we get 𝛽 = 𝑋′𝑋 𝑋′𝑦 (if the inverse of the matrix 𝑋 ′ 𝑋 exists.) 29 2. Example 11.2 (Tire Wear Data: Quadratic Fit Using Hand Calculations) • We will do Example 11.1 again in this part using the matrix approach. • For the quadratic model to be fitted 1 1 1 1 𝑋= 1 1 1 1 1 0 4 8 12 16 20 24 28 32 0 16 64 144 256 400 576 784 1024 𝑎𝑛𝑑 394.33 329.50 291.00 255.17 𝑦 = 229.33 204.83 179.00 163.83 150.33 30 • According to formula ′ -1 ′ 𝛽= 𝑋𝑋 𝑋𝑦 we need to calculate 𝑋 ′ 𝑋 first and then invert it and get (𝑋 ′ 𝑋)−1 9 144 3264 3264 82,944 𝑋 ′ 𝑋 = 144 3264 82,944 2,245,632 0.6606 −0.0773 0.0019 (𝑋 ′ 𝑋)−1 = −0.0773 0.0140 −0.0004 0.0019 −0.0004 0.0000 31 • Finally, we calculate the vector of LS estimates 𝛽 𝛽 𝛽0 = 𝛽1 𝛽2 = (𝑋 ′ 𝑋)−1 𝑋 ′ 𝑦 394.33 329.50 291.00 0.6606 −0.0773 0.0019 1 1 1 1 1 1 1 1 1 255.17 = −0.0773 0.0140 −0.0004 0 4 8 12 16 20 24 28 32 229.33 0.0019 −0.0004 0.0000 0 16 64 144 256 400 576 784 1024 204.83 179.00 163.83 150.33 386.265 = −12.722 0.172 32 • Therefore, the LS quadratic model is 𝑦 = 386.265 − 12.772𝑥 + 0.172𝑥 2 . This model is the same as we obtained in Example 11.1. 33 34 Statistical Inference for Multiple Regression • Determine which predictor variables have statistically significant effects • We test the hypotheses: H 0 j : j 0 vs. H 1 j : j 0 • If we can’t reject H0j, then xj is not a significant predictor of y. 35 Statistical Inference on ' s • Review statistical inference for Simple Linear Regression 2 ˆ1 1 ˆ 1 N 1, Sxx / S xx (n 2) S 2 SSE 2 n2 2 2 t: N (0,1) N W / (n 2) ˆ 1 1 / S xx (n 2) S 2 (n 2) 2 tn 2 36 Statistical Inference on ' s • What about Multiple Regression? • The steps are similar ˆ 1 ˆ 1 1 N 1, V jj V jj 2 [ n ( k 1)]S 2 2 N t: W / [n (k 1)] SSE 2 ˆ 1 1 V jj N (0,1) n2 ( k 1) [n (k 1)]S 2 [n (k 1)] 2 tn ( k 1) 37 Statistical Inference on ' s • What’s Vjj? Why ˆ 1 N 1, 2V jj ? 1. Mean Recall from simple linear regression, the least squares estimators for the regression parameters ˆ 0 and ˆ 1 are unbiased. ˆ ) E ( ˆ 0 0 Here, of least ˆ) 1 E ( 1 squares estimators E ( ˆ ) is also unbiased. E ( ˆ ) k k 38 Statistical Inference on ' s • 2.Variance • Constant Variance assumption: – V (i ) 2 0 2 0 var(Y ) 0 0 0 2 0 0 2 Ik 2 39 Statistical Inference on ' s T 1 T ˆ • (X X ) X Y var( ˆ ) var(cY ) c var(Y )c cY T T ( X T X ) 1 X ( 2I k)( ( X T X )1 X T )T (X X ) T 1 • Let Vjj be the jth diagonal of the matrix ( X X ) var(ˆ ) 2V 2 T j 1 jj 40 Statistical Inference on ' s 2 ˆ ˆ Sum up, E ( j ) j , var( j ) V jj and we get ˆ j N ( j , V jj ) 2 ˆ j j V jj N (0,1) 41 Statistical Inference on ' s • Like simple linear regression, the unbiased estimator of the unknown error variance 2 is given by 2 e SSE MSE 2 i S n (k 1) n (k 1) d . f . W (n (k 1)) S 2 2 SSE 2 ~ 2 n ( k 1) and that S 2 and ˆ j are statistically independent 42 Statistical Inference on ' s • Therefore, ˆ j j V jj ˆ j j V jj t N (0,1), (n (k 1)) S 2 2 ˆ j j [n (k 1)]S 2 [n (k 1)] S V jj ~ n2( k 1) 2 ˆ j j ˆ j j S V jj SE ( ˆ j ) tn ( k 1) tn ( k 1) SE ( ˆ j ) s v jj 43 Statistical Inference on ' s • Derivation of confidence interval of j P(tn ( k 1), /2 ˆ j j SE ( ˆ j ) tn ( k 1), /2 ) 1 P(ˆ j tn( k 1), /2 SE (ˆ j ) j ˆ j tn( k 1), /2 SE (ˆ j )) 1 The 100(1-α)% confidence interval for j is ˆ j tn( k 1), /2 SE (ˆ j ) 44 Statistical Inference on ' s • An level test of hypotheses H 0 j : j 0j vs. H 1 j : j 0j P (Reject H0 j | H 0 j is true) P( t j c) c tn( k 1), /2 • Rejects H0j if 0 ˆ j j tj tn ( k 1), /2 SE ( ˆ j ) 45 Prediction of Future Observation • Having fitted a multiple regression model, suppose we wish to predict the future value of Y for a specified vector of predictor variables x*=(x0*,x1*,…,xk*) • One way is to estimate E(Y*) by a confidence interval(CI). 46 Prediction of Future Observation ˆ * E (Yˆ * ) ˆ0 ˆ1 x1* ˆk xk* ( xk* )T ˆ Var[( xk* )T ˆ ] ( xk* )T Var ( ˆ ) xk* ( xk* )T 2 ( X T X ) 1 ( xk* )T 2 ( xk* )T V ( xk* )T Replacing 2 by its estimate s 2 MSE, which has n K 1 d.f ., and using methods as in Simple Linear Regression, a (1- )-level CI for * is given by * tn ( k 1), /2 s ( x* )T Vx* * * tn ( k 1), /2 s ( x* )T Vx* 47 • F-Test for j ' s Consider: H 0 : 1 k 0; vs H1 : At least one j 0. Here H 0 is the overall null hypothesis, which states that none of the x variables are related to y . The alternative one shows at least one is related. 48 How to Build a F-Test…… • The test statistic F=MSR/MSE follows Fdistribution with k and n-(k+1) d.f. The α -level test rejects H 0 if MSR F f k ,n ( k 1), MSE recall that MSE(error mean square) 2 e i 1 i n MSE n ( k 1) with n-(k+1) degrees of freedom. 49 The relation between F and r F can be written as a function of r. By using the formula: SSR r SST ; F can be as: 2 SSE (1 r )SST . 2 r 2 [n (k 1)] F 2 k (1 r ) We see that F is an increasing function of r ² and test the significance of it. 50 Analysis of Variance (ANOVA) The relation between SST, SSR and SSE: SST SSR SSE where they are respectively equals to: n n n SST ( yi y ) ; SSR ( yi y ) ; SSE ( yi yi )2 2 i 1 2 i 1 i 1 The corresponding degrees of freedom(d.f.) is: d . f .(SST ) n 1; d . f .(SSR) k; d. f .(SSE) n (k 1). 51 ANOVA Table for Multiple Regression Source of Variation (source) Sum of Squares (SS) Degrees of Freedom (d.f.) Regression SSR k Error SSE n-(k+1) Total SST n-1 Mean Square F (MS) SSR k SSE MSE n ( k 1) MSR F MSR MSE This table gives us a clear view of analysis of variance of Multiple Regression. 52 Extra Sum of Squares Method for Testing Subsets of Parameters Before, we consider the full model with k parameters. Now we consider the partial model: Yi 0 1xi1 k m xi,k m i (i 1, 2, , n) while the rest m coefficients are set to zero. And we could test these m coefficients to check out the significance: H 0 : k m1 k 0; vs H1 : At least one of k m1 , , k 0. 53 Building F-test by Using Extra Sum of Squares Method Let SSRk m and SSEk m be the regression and error sums of squares for the partial model. Since SST Is fixed regardless of the particular model, so: SST SSRk m SSEk m SSRk SSEk then, we have: SSRk m SSEk SSRk SSRk m The α-level F-test rejects null hypothesis if ( SSEk m SSEk ) / m F f m,n( k 1), SSEk / [n (k 1)] 54 Remarks on the F-test The numerator d.f. is m which is the number of coefficients set to zero. While the denominator d.f. is n-(k+1) which is the error d.f. for the full model. The MSE in the denominator is the normalizing factor, which is an estimate of σ² for the full model. If the ratio is large, we reject H 0 . 55 Links between ANOVA and Extra Sum of Squares Method Let m=1 and m=k respectively, we have: SSE0 i 1 ( yi y ) 2 SST , SSEk SSE n From above we can derive: SSE0 SSEk SST SSE SSR Hence, the F-ratio equals: SSR / k MSR F SSE / [n (k 1)] MSE with k and n-(k+1) d.f. 56 57 5 Regression Diagnostics 5.1 Checking the Model Assumptions Plots of the residuals against individual predictor variables: check for linearity A plot of the residuals against fitted values: check for constant variance A normal plot of the residuals: check for normality 58 A run chart of the residuals: check if the random errors are auto correlated. Plots of the residuals against any omitted predictor variables: check if any of the omitted predictor variables should be included in the model. 59 Example: Plots of the residuals against individual predictor variables 60 SAS code 61 Example: plot of the residuals against fitted values 62 SAS code 63 Example: normal plot of the residuals 64 SAS code 65 5.2 Checking for Outliers and Influential Observations • Standardized residuals * e i e SE(e ) i i e i s 1 hii . Large e*i values indicate outlier observation. • Hat matrix H X X X X 1 If the Hat matrix diagonal h ith observation is influential. ii 2k 1 n , then 66 Example: graphical exploration of outliers 67 Example: leverage plot 68 5.3 Data transformation Transformations of the variables(both y and the x’s) are often necessary to satisfy the assumptions of linearity, normality, and constant error variance. Many seemingly nonlinear models can be written in the multiple linear regression model form after making a suitable transformation. For example, y* 0 x11 x22 after transformation: log y log 0 1 log x1 2 log x2 or y* 0* 1* x1* 2* x2* 69 70 Multicollinearity • Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. • Example of multicollinear predictors are height and weight of a person, years of education and income, and assessed value and square footage of a home. • Consequences of high multicollinearity: a. Increased standard error of estimates of the β ’s b. Often confused and misled results. 71 Detecting Multicollinearity • Easy way: compute correlations between all pairs of predictors. If some r are close to 1 or -1, remove one of the two correlated predictors from the model. Variable X1 X2 X3 X1 2 12 13 X2 21 2 23 X3 31 32 2 Correlations Equal to 1 X1 colinear X2 X2 independent X3 72 Detecting Multicollinearity • Another way: calculate the variance inflation factors for each predictor xj: where is the coefficient of determination of the model that includes all predictors except the jth predictor. • If VIFj≥10, then there is a problem of multicollinearity. 73 Muticollinearity-Example • See Example11.5 on Page 416, Response is the heat of cement on a per gram basis (y) and predictors are tricalcium aluminate(x1), tricalcium silicate(x2), tetracalcium alumino ferrite(x3) and dicalcium silicate(x4). 74 Muticollinearity-Example • Estimated parameters in first order model: ˆy =62.4+1.55x1+0.510x2+0.102x3-0.144x4. • F = 111.48 with p−value below 0.0001. Individual t−statistics and p−values: 2.08 (0.071), 0.7 (0.501) and 0.14 (0.896), -0.20 (0.844). • Note that sign on β4 is opposite of what is expected. And very high F would suggest more than just one significant predictor. 75 Muticollinearity-Example • Correlations • Correlations were r13 = -0.824, r24 =-0.973. Also the VIF were all greater than 10. So there is a multicollinearity problem in such model and we need to choose the optimal algorithm to help us select the variables necessary. 76 Muticollinearity-Subsets Selection • Algorithms for Selecting Subsets – All possible subsets • Only feasible with small number of potential predictors (maybe 10 or less) • Then can use one or more of possible numerical criteria to find overall best – Leaps and bounds method • • • • Identifies best subsets for each value of p Requires fewer variables than observations Can be quite effective for medium-sized data sets Advantage to have several slightly different models to compare 77 Muticollinearity-Subsets Selectioin – Forward stepwise regression • Start with no predictors – First include predictor with highest correlation with response – In subsequent steps add predictors with highest partial correlation with response controlling for variables already in equations – Stop when numerical criterion signals maximum (minimum) – Sometimes eliminate variables when t value gets too small • Only possible method for very large predictor pools • Local optimization at each step, no guarantee of finding overall optimum – Backward elimination • Start with all predictors in equation – Remove predictor with smallest t value – Continue until numerical criterion signals maximum (minimum) • Often produces different final model than forward stepwise method 78 Muticollinearity-Best Subsets Criteria • Numerical Criteria for Choosing Best Subsets – No single generally accepted criterion • Should not be followed too mindlessly – Most common criteria combine measures of with add penalties for increasing complexity (number of predictors) – Coefficient of determination • Ordinary multiple R-square • • Always increases with increasing number of predictors, so not very good for comparing models with different numbers of predictors – Adjusted R-Square • Will decrease if increase in R-Square with increasing p is small 79 Muticollinearity-Best Subsets Criteria – Residual mean square (MSEp) • Equivalent to adjusted r-square except look for minimum • Minimum occurs when added variable doesn't decrease error sum of squares enough to offset loss of error degree of freedom – Mallows' Cp statistic • • Should be about equal to p and look for small values near p • Need to estimate overall error variance – PRESS statistic • The one associated with the minimum value of PRESSp is chosen • Intuitively easier to grasp than the Cp-criterion. 80 Muticollinearity-Forward Stepwise • First include predictor with highest correlation with response >FIN=4 81 Muticollinearity-Forward Stepwise • In subsequent steps add predictors with highest partial correlation with response controlling for variables already in equations. (if Fi>FIN=4, enter the Xi and Fi<FOUT=4, remove the Xi) >FIN=4 82 Muticollinearity-Forward Stepwise >FIN=4 <FOUT=4 83 Muticollinearity-Forward Stepwise • Summarize the stepwise algorithms • Therefore our “Best Model” should only include x1 and x2, which is y=52.5773+1.4683x1+0.6623x2 84 Muticollinearity-Forward Stepwise • Check the significance of the model and individual parameter again. We find p value are all small and each VIF is far less than 10. 85 Muticollinearity-Best Subsets • Also we can stop when numerical criterion signals maximum (minimum) and sometimes eliminate variables when t value gets too small. 86 Muticollinearity-Best Subsets • The largest R squared value 0.9824 is associated with the full model. • The best subset which minimizes the Cp-criterion includes x1,x2 • The subset which maximizes Adjusted R squared or equivalently minimizes MSEp is x1,x2,x4. And the Adjusted R squared increases only from 0.9744 to 0.9763 by the addition of x4to the model already containing x1 and x2. • Thus the simpler model chosen by the Cp-criterion is preferred, which the fitted model is y=52.5773+1.4683x1+0.6623x2 87 Polynomial model • • Polynomial models are useful in situations where the analyst knows that curvilinear effects are present in the true response function. We can do this with more than one explanatory variable using Polynomial regression model: 88 Multicollinearity-Polynomial Models • Multicollinearity is a problem in polynomial regression (with terms of second and higher order): x and x2 tend to be highly correlated. • A special solution in polynomial models is to use zi = xi − ¯xi instead of just xi. That is, first subtract each predictor from its mean and then use the deviations in the model. 89 Multicollinearity – Polynomial model • Example: x = 2, 3, 4, 5, 6 and x2 = 4, 9, 16, 25, 36. As x increases, so does x2. rx,x2 = 0.98. • = 4 then z = −2,−1, 0, 1, 2 and z2 = 4, 1, 0, 1, 4. Thus, z and z2 are no longer correlated. rz,z2 = 0. • We can get the estimates of the β’s from the estimates of the γ ’s. Since 90 Dummy Predictor Variable The dummy variable is a simple and useful method of introducing into a regression analysis information contained in variables that are not conventionally measured on a numerical scale, e.g., race, gender, region, etc. 91 Dummy Predictor Variable • The categories of an ordinal variable could be assigned suitable numerical scores. • A nominal variable with c≥2 categories can be coded using c – 1 indicator variables, X1,…,Xc-1, called dummy variables. • Xi=1, for ith category and 0 otherwise • X1=,…,=Xc-1=0, for the cth category 92 Dummy Predictor Variable • If y is a worker’s salary and Di = 1 if a non-smoker Di = 0 if a smoker We can model this in the following way: yi Di ut 93 Dummy Predictor Variable • Equally we could have used the dummy variable in a model with other explanatory variables. In addition to the dummy variable we could also add years of experience (x), to give: yi Di xi ut E ( yi ) X For non-smoker E ( yi ) X For smoker 94 Dummy Predictor Variable y Non-smoker Smoker α+β α x 95 Dummy Predictor Variable • We can also add the interaction to between smoking and experience with respect to their effects on salary. yi Di xi Di xi ut E ( yi ) ( ) ( ) X For non-smoker E ( yi ) X For smoker 96 Dummy Predictor Variable y Non-smoker Smoker α+β α x 97 Standardized Regression Coefficients • We typically wants to compare predictors in terms of the magnitudes of their effects on response variable. • We use standardized regression coefficients to judge the effects of predictors with different units 98 Standardized Regression Coefficients • They are the LS parameter estimates obtained by running a regression on standardized variables, defined as follows: _ yi y * yi sy _ x * ij xij x j sxij (i 1, 2, , n; j 1, 2, , k) • Where sy and sxj are sample SD’s of yi and x j 99 Standardized Regression Coefficients 0* 0 • Let s * • And j ( xj )( j 1,2, sy ,k) *j • The magnitudes of can be directly compared to judge the relative effects of x j on y. 100 Standardized Regression Coefficients 0* • Since 0 , the constant can be dropped from the model. Let y* be the vector of the y* ' s and x* be the matrix of x* ' s 1 rx1x2 r 1 1 *' * x 2 x 1 x x R n1 rxkx1 rxkx2 ryx1 rx1xk ryx2 rx2 xk 1 x*' y* r n 1 ryxk 1 101 Standardized Regression Coefficients • So we can get * 1 * (x*'x*)1x*' y* R1r * k • This method of computing j ' s is numerically more stable than computing j ' s directly, because all entries of R and r are between -1 and 1. 102 Standardized Regression Coefficients • Example (Given in page 424) • From the calculation, we can obtain that 1 0.19244, 2 0.3406 And sample standard deviations of x1,x2 and are sx1 6.830, sx 2 0.641, sy 1.501 * * sx1 sx 2 1 1 ( ) 0.875, 2 2 ( ) 0.105 sy sy * 2 ,although 2 .Thus x1 has a 1 Then we have Note that larger effect than x2 on y. * 1 103 Standardized Regression Coefficients • We can also use the matrix method to compute standardized regression coefficients. • First we compute the correlation matrix between x1 ,x2 and y x1 x2 • Then we have • Next calculate • Hence 0.913 1 R 1 0.913 R 1 1 1 rx21x 2 * 1 R 1r 0.875 0.105 * 2 1 r x1x 2 y x2 0.913 0.971 0.904 0.971 r 0.904 rx1x 2 6.009 5.586 1 5.486 6.009 • Which is as same result as before 104 105 How to decide their salaries? 32 23 Attacker Defender 5 years 11 years more than 20 goals per year Lionel Messi 10,000,000 EURO/yr less than 1 goals per year Carles Puyol 5,000,000 EURO/yr 106 How to select variables? • 1) Stepwise Regression • 2)Best Subset Regression 107 Stepwise Regression • Partial F-test • Partial Correlation Coefficients • How to do it by SAS? • Drawbacks 108 Partial F-test (p-1)-Variable Model: Yi 0 1xi1 ... p1xi, p1 i p-Variable Model: Yi 0 1xi1 ... p1xi, p1 p xi, p i 109 How to do the test? H1 p:1 p 0 vs H 0 p: p 0 We reject H0 p in favor of H1 p at level α if Fp (SSE p 1 SSE p ) /1 SSE p / [n ( p 1)] f ,1,n( p 1) 110 Another way to interpret the test: • test statistics: p tp SE ( p ) t 2p Fp • We reject H0 p at level α if | t p | tn( p 1), /2 111 Partial Correlation Coeffientients 2 yx p | x1,..., x p1 r SSE p1 SSE p SSE p 1 SSE ( x1 ,..., x p 1 ) SSE ( x1 ,..., x p ) SSE ( x1 ,..., x p 1 ) test statistics: Fp t 2 p *Add 2 yx p | x1,..., x p1 r [n ( p 1)] 1 ryx2 p | x1,..., x p1 xpto the regression equation that includes x1,..., xp1 only if Fp is large enough. 112 How to do it by SAS? (EX9 Continuity of Ex5) The table shows data on the heat evolved in calories during the hardening of cement on a per gram basis (y) along with the percentages of four ingredients: tricalcium aluminate (x1), tricalcium silicate (x2), tetracalcium alumino ferrite (x3), and dicalcium silicate (x4). No. X1 X2 X3 X4 Y 1 7 26 6 60 78.5 2 1 29 15 52 74.3 3 11 56 8 20 104.3 4 11 31 8 47 87.6 5 7 52 6 33 95.9 6 11 55 9 22 109.2 7 3 71 17 6 102.7 8 1 31 22 44 72.5 9 2 54 18 22 93.1 10 21 47 4 26 1159 11 1 40 23 34 83.8 12 11 66 9 12 113.3 13 10 68 8 12 109.4 113 data example1; input x1 x2 x3 x4 y; datalines; 7 26 6 60 78.5 1 29 15 52 74.3 11 56 8 20 104.3 11 31 8 47 87.6 7 52 6 33 95.9 11 55 9 22 109.2 3 71 17 6 102.7 1 31 22 44 72.5 2 54 18 22 93.1 21 47 4 26 115.9 1 40 23 34 83.8 11 66 9 12 113.3 10 68 8 12 109.4 SAS Code ; Run; proc reg data=example1; model y= x1 x2 x3 x4 /selection=stepwise; run; 114 SAS output 115 SAS output 116 Interpretation • At the first step, x4 is chosen into the equation as it has the largest correlation with y among the 4 predictors; • At the second step, we choose x1 into the equation for it has the highest partial correlation with y controlling for x4; • At the third step, since ryx2 | x4 , x1 is greater than ryx3 |x4 , x1 , x2 is chosen into the equation rather than 117 x3. Interpretation • At the 4th step, we removed x4 from the model since its partial F-statistics is too small. • From Ex11.5, we know that x4 is highly correlated with x2. Note that in Step4, the R-Square is 0.9787, which is slightly higher that 0.9725, the R-Square of Step 2. It indicates that even x4 is the best predictor of y, the pair (x1,x2) is a better predictor than the predictor (x1,x4). 118 Drawbacks • The final model is not guaranteed to be optimal in any specified case. • It yields a single final model while in practice there are often several equally good model. 119 Best Subset Regression • Comparison to Stepwise Method • Optimality Criteria • How to do it by SAS? 120 Comparison to Stepwise Regression • In best subsets regression, a subset of variables is chosen from that optimizes a welldefined objective criterion. • The best regression algorithm permits determination of a specified number of best subsets from which the choice of the final model can be made by the investigator. 121 Optimality Criteria r Criterion 2 p r 2 p SSRp SST 1 SSE p SST Adjusted rp2 Criterion 2 adj , p r 1 SSE p / (n ( p 1)) SST / n 1 1 MSE p MST 122 Optimality Criteria Cp Criterion Standardized mean square error of prediction: n p 1 2 E[Y i 1 ip E (Yi )] 2 unknown parameters such as j ‘s, so minimize a sample estimate of p . Mallows’ Cp statistic : p involves Cp SSE p 2 2( p 1) n 123 Optimality Criteria • It practice, we use the Cp Criterion because of its ease of computation and its ability to judge the predictive power of a model. 124 How to do it by SAS?(Ex11.9) • proc reg data=example1; model y= x1 x2 x3 x4 /selection=adjrsq mse cp; run; 125 SAS output 126 Interpretation • The best subset which minimizes the Cp Criterion is x1, x2 which is the same model selected using stepwise regression in the former example. • The subset which maximizes radj2 , p is x1, x2, x4. 2 However, radj , p increases only from 0.9744 to 0.9763 by the addition of x4 to the model which already contains x1 and x2. • Thus, the model chosen by the Cp Criterion is preferred. 127 128 Model (Extension of Simple Regression): yi 0 1xi1 2 xi 2 k xik i Multiple Regression Model 0, 1, 2, ....k are unknown parameters Least squares method: n Q [ yi ( 0 1xi1 2 xi 2 ... kxik )]2 i 1 Fitting the MLR Model n Q 2[ yi ( 0 1xi1 2 xi 2 ... kxik )] 0 0 i 1 n Q 2[ yi ( 0 1xi1 2 xi 2 ... kxik )]xij 0 j i 1 SSR Goodness of fit of the model: r SST 2 MLR Model in Matrix Notation Y X ( X ' X )1 X 'Y ( X ' X )1 X 'Y 129 Statistical Inference on ' s Hypotheses: Statistical Inference for Multiple Regression Test statistic: Hypotheses: H0 j : j 0 vs. H1 j : j 0 ˆ j j Z T ~ Tn ( k 1) W / n (k 1) S v jj H0 : 1 Test statistic: k 0 vs. H a : At least one j 0 MSR r 2{n (k 1)} F MSE k (1 r 2 ) Residual Analysis Regression Diagnostics Data Transformation 130 The General Hypothesis Test: Compare the full model: the partial model: Yi 0 1 x i1 ... k x ik i Yi 0 1 x i1 ... km x i,km i Hypotheses: H 0 : km 1 ... k 0 vs. H a : j 0 ( SSEk m SSEk ) / m ~ f m,n ( k 1) Test statistic: F0 SSEk /[n (k 1)] RejectH 0 when F f m, n( k 1), 0 Estimating and Predicting Future Observations: * * * * ' Let x (x 0 , x1 ,...,x k ) Test statistic: T and ˆ * * * * s x Vx * Y * 0 1x1* ... k xk* x* ~ Tn( k 1) CI for the estimated mean *: PI for the estimated Y*: ˆ * t n ( k 1), / 2 s x* Vx * Yˆ * t n ( k 1), / 2 s 1 x* Vx * 131 Topics in regression modeling • Multicollinearity • Polynomial Regression • Dummy Predictor Variables • Logistic egression Model • Stepwise Regression: partial F-test ryx2 p / x1, xp1 SSE p1 SSE p 2 ryx x partial Correlation Fp Coefficient 1 ryx2 p/ 1 Variable Selection Methods SSE p1 x p1 n p 1 • Stepwise Regression Algorithm • Best Subsets Regression p / x1 x p1 Strategy for building a MLR model 132 Application of the MLR model Linear regression is widely used in biological, chemistry, finance and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines. 133 Financial market biology Multiple linear regression Housing price heredity Chemistry 134 Example Broadly speaking, an asset pricing model can be expressed as: ri ai b1 j f1 b2 j f2 bkj f k i Where ri , f k and k denote the expected return on asset i, the kth risk factor and the number of risk factors, respectively. i denotes the specific return on asset i. 135 The equation can also be expressed in the matrix notation: is called the factor loading 136 Inflation rate GDP Interest rate Rate of return on the market portfolio Employment rate Government policies 137 Method • Step 1: Find the efficient factors (EM algorithms, maximum likelihood) • Step 2: Fit the model and estimate the factor loading (Multiple linear regression) 138 • According to the multiple linear regression and run data on SAS, we can get the factor loading and the coefficient of multiple 2 determination r • We can ensure the factors that mostly effect the return in term of SAS output and then build the appropriate multiple factor models • We can use the model to predict the future return and make a good choice! 139 Thank you 140