Overview of our study of the multiple linear regression model Regression models with more than one slope parameter Example 1 Is brain and body size predictive of intelligence? • Sample of n = 38 college students • Response (y): intelligence based on PIQ (performance) scores from the (revised) Wechsler Adult Intelligence Scale. • Potential predictor (x1): Brain size based on MRI scans (given as count/10,000). • Potential predictor (x2): Height in inches. • Potential predictor (x3): Weight in pounds. Example 1 Scatter matrix plot 3 .728 8 2 . 00 .75 3. 25 27. 5 70.5 6 5 8 1 6 7 1 1 130.5 PIQ 91.5 100.728 Brain 86.283 73.25 Height 65.75 Weight Example 1 Scatter matrix plot 91.5 100.728 86.283 73.25 65.75 Brain 130.5 PIQ Weight 7.5 70.5 2 1 1 Height Brain Height 8 83 0. 72 2 . . 75 3. 25 6 0 5 8 1 6 7 Scatter matrix plot • Illustrates the marginal relationships between each pair of variables without regard to the other variables. • The challenge is how the response y relates to all three predictors simultaneously. Example 1 A multiple linear regression model with three quantitative predictors yi 0 1 xi1 2 xi 2 3 xi 3 i where … • yi is intelligence (PIQ) of student i • xi1 is brain size (MRI) of student i • xi2 is height (Height) of student i • xi3 is weight (Weight) of student i and … the independent error terms i follow a normal distribution with mean 0 and equal variance 2. Example 1 Some research questions • Which predictors – brain size, height, or weight – explain some variation in PIQ? • What is the effect of brain size on PIQ, after taking into account height and weight? • What is the PIQ of an individual with a given brain size, height, and weight? Example 1 The regression equation is PIQ = 111 + 2.06 Brain - 2.73 Height + 0.001 Weight Predictor Constant Brain Height Weight Coef 111.35 2.0604 -2.732 0.0006 S = 19.79 R-Sq = 29.5% Analysis of Variance Source DF Regression 3 Residual Error 34 Total 37 Source Brain Height Weight SE Coef 62.97 0.5634 1.229 0.1971 DF 1 1 1 SS 5572.7 13321.8 18894.6 Seq SS 2697.1 2875.6 0.0 T 1.77 3.66 -2.22 0.00 P 0.086 0.001 0.033 0.998 R-Sq(adj) = 23.3% MS 1857.6 391.8 F 4.74 P 0.007 Example 2 Baby bird breathing habits in burrows? • Experiment with n = 120 nestling bank swallows • Response (y): % increase in “minute ventilation”, Vent, i.e., total volume of air breathed per minute • Potential predictor (x1): percentage of oxygen, O2, in the air the baby birds breathe • Potential predictor (x2): percentage of carbon dioxide, CO2, in the air the baby birds breathe Example 2 Scatter matrix plot .5 14 .5 17 5 2.2 5 6.7 484.75 Vent 52.25 17.5 O2 14.5 CO2 Example 2 Three-dimensional scatter plot 600 400 Vent 200 0 -200 13 14 15 O2 16 17 18 0 19 2 4 6 8 CO2 Example 2 A first order model with two quantitative predictors yi 0 1 xi1 2 xi 2 i where … • yi is percentage of minute ventilation • xi1 is percentage of oxygen • xi2 is percentage of carbon dioxide and … the independent error terms i follow a normal distribution with mean 0 and equal variance 2. Example 2 Some research questions • Is oxygen related to minute ventilation, after taking into account carbon dioxide? • Is carbon dioxide related to minute ventilation, after taking into account oxygen? • What is the mean minute ventilation of all nestling bank swallows whose breathing air is comprised of 15% oxygen and 5% carbon dioxide? Example 2 The regression equation is Vent = 86 - 5.33 O2 + 31.1 CO2 Predictor Constant O2 CO2 Coef 85.9 -5.330 31.103 S = 157.4 R-Sq = 26.8% Analysis of Variance Source DF Regression 2 Residual Error 117 Total 119 Source O2 CO2 SE Coef 106.0 6.425 4.789 DF 1 1 SS 1061819 2897566 3959385 Seq SS 17045 1044773 T 0.81 -0.83 6.50 P 0.419 0.408 0.000 R-Sq(adj) = 25.6% MS 530909 24766 F 21.44 P 0.000 Example 3 Is baby’s birth weight related to smoking during pregnancy? • Sample of n = 32 births • Response (y): birth weight in grams of baby • Potential predictor (x1): smoking status of mother (yes or no) • Potential predictor (x2): length of gestation in weeks Example 3 Scatter matrix plot 36 40 5 0.2 5 0. 7 3252.5 Weight 2697.5 40 Gest 36 Smoking Example 3 A first order model with one binary predictor yi 0 1 xi1 2 xi 2 i where … • yi is birth weight of baby i • xi1 is length of gestation of baby i • xi2 = 1, if mother smokes and xi2 = 0, if not and … the independent error terms i follow a normal distribution with mean 0 and equal variance 2. Example 3 Estimated first order model with one binary predictor The regression equation is Weight = - 2390 + 143 Gest - 245 Smoking Weight (grams) 3700 0 1 3200 2700 2200 34 35 36 37 38 39 Gestation (weeks) 40 41 42 Example 3 Some research questions • Is baby’s birth weight related to smoking during pregnancy? • How is birth weight related to gestation, after taking into account smoking status? Example 3 The regression equation is Weight = - 2390 + 143 Gest - 245 Smoking Predictor Constant Gest Smoking Coef -2389.6 143.100 -244.54 S = 115.5 SE Coef 349.2 9.128 41.98 R-Sq = 89.6% T -6.84 15.68 -5.83 P 0.000 0.000 0.000 R-Sq(adj) = 88.9% Analysis of Variance Source Regression Residual Error Total Source Gest Smoking DF 1 1 DF 2 29 31 SS 3348720 387070 3735789 Seq SS 2895838 452881 MS 1674360 13347 F 125.45 P 0.000 Example 4 Compare three treatments (A, B, C) for severe depression • Random sample of n = 36 severely depressed individuals. • y = measure of treatment effectiveness • x1 = age (in years) • x2 = 1 if patient received A and 0, if not • x3 = 1 if patient received B and 0, if not Example 4 Compare three treatments (A, B, C) for severe depression 75 A B C 65 y 55 45 35 25 20 30 40 50 age 60 70 Example 4 A second order model with one quantitative predictor, a three-group qualitative variable, and interactions yi 0 1 xi1 2 xi 2 3 xi 3 12 xi1 xi 2 13 xi1 xi 3 i where … • yi is treatment effectiveness for patient i • xi1 is age of patient i • xi2 = 1, if treatment A and xi2 = 0, if not • xi3 = 1, if treatment B and xi3 = 0, if not Example 4 The estimated regression function Regression equation is y = 6.21 + 1.03 age + 41.3 x2 + 22.7 x3 - 0.703 agex2 - 0.510 agex3 80 A B C y = 47.5 + 0.33x 70 y 60 50 y = 28.9 + 0.52x 40 30 y = 6.21 + 1.03x 20 20 30 40 50 age 60 70 Example 4 Potential research questions • Does the effectiveness of the treatment depend on age? • Is one treatment superior to the other treatment for all ages? • What is the effect of age on the effectiveness of the treatment? Regression equation is y = 6.21 + 1.03 age + 41.3 x2 + 22.7 x3 - 0.703 agex2 - 0.510 agex3 Example 4 Predictor Constant age x2 x3 agex2 agex3 Coef 6.211 1.03339 41.304 22.707 -0.7029 -0.5097 SE Coef 3.350 0.07233 5.085 5.091 0.1090 0.1104 S = 3.925 R-Sq = 91.4% Analysis of Variance Source DF SS Regression 5 4932.85 Residual Error 30 462.15 Total 35 5395.00 Source age x2 x3 agex2 agex3 DF 1 1 1 1 1 Seq SS 3424.43 803.80 1.19 375.00 328.42 T 1.85 14.29 8.12 4.46 -6.45 -4.62 P 0.074 0.000 0.000 0.000 0.000 0.000 R-Sq(adj) = 90.0% MS 986.57 15.40 F 64.04 P 0.000 Example 5 How is the length of a bluegill fish related to its age? • In 1981, n = 78 bluegills randomly sampled from Lake Mary in Minnesota. • y = length (in mm) • x1 = age (in years) Example 5 Scatter plot 200 length 150 100 1 2 3 4 age 5 6 Example 5 A second order polynomial model with one quantitative predictor yi 0 1 xi 11 x i 2 i where … • yi is length of bluegill (fish) i (in mm) • xi is age of bluegill (fish) i (in years) and … the independent error terms i follow a normal distribution with mean 0 and equal variance 2. Example 5 Estimated regression function Regression Plot length = 13.6224 + 54.0493 age - 4.71866 age**2 S = 10.9061 R-Sq = 80.1 % R-Sq(adj) = 79.6 % 200 length 150 100 1 2 3 4 age 5 6 Example 5 Potential research questions • How is the length of a bluegill fish related to its age? • What is the length of a randomly selected five-year-old bluegill fish? Example 5 The regression equation is length = 148 + 19.8 c_age - 4.72 c_agesq Predictor Constant c_age c_agesq S = 10.91 Coef 147.604 19.811 -4.7187 SE Coef 1.472 1.431 0.9440 R-Sq = 80.1% T 100.26 13.85 -5.00 P 0.000 0.000 0.000 R-Sq(adj) = 79.6% Analysis of Variance Source DF SS MS F P Regression 2 35938 17969 151.07 0.000 Residual Error 75 8921 119 Total 77 44859 ... Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 165.90 2.77 (160.39, 171.42) (143.49, 188.32) Values of Predictors for New Observations New c_age c_agesq 1 1.37 1.88 The good news! • Everything you learned about the simple linear regression model extends, with at most minor modification, to the multiple linear regression model: – – – – same assumptions, same model checking (adjusted) R2 t-tests and t-intervals for one slope prediction (confidence) intervals for (mean) response New things we need to learn! • The above research scenarios (models) and a few more • The “general linear test” which helps to answer many research questions • F-tests for more than one slope • Interactions between two or more predictor variables • Identifying influential data points New things we need to learn! • Detection of (“variance inflation factors”) correlated predictors (“multicollinearity”) and the limitations they cause • Selection of variables from a large set of variables for inclusion in a model (“stepwise regression and “best subsets regression”)