Uploaded by Vipul Khandke

Multiple Linear Regression: Applied Predictive Analytics

advertisement
ICS422 Applied Predictive
Analytics [3- 0-0-3]
Multiple Linear Regression
Class 19
Presented by
Dr. Selvi C
Assistant Professor
IIIT Kottayam
REGIONAL DELIVERY SERVICE
• Let's assume that you are a small business owner for Regional
Delivery Service, Inc. (RDS) who offers same-day delivery for letters,
packages, and other small cargo.
• You are able to use Google Maps to group individual deliveries into
one trip to reduce time and fuel costs. Therefore some trips will have
more than one delivery.
• As the owner, you would like to be able to estimate how long a
delivery will take based on two factors: 1) the total distance of the
trip in miles and 2) the number of deliveries that must be made
during the trip.
2
RDS DATA AND VARIABLE NAMING
• To conduct your analysis you take a random sample
num
milesTraveled
travelTime(
of 10 past trips and record three pieces of
Deliveries
(x1)
hrs), (y)
(x2)
information for each trip: 1) total miles traveled, 2)
89
4
7
number of deliveries, and 3) total travel time in hours.
66
1
5.4
78
3
6.6
• Remember that in this case, you would like to be able
111
6
7.4
to predict the total travel time using both the miles
44
1
4.8
traveled and number of deliveries on each trip.
77
3
6.4
80
3
7
• In what way does travel time DEPEND on the first
66
2
5.6
two measures?
109
5
7.3
76
3
6.4
• Travel time is the dependent variable and miles
traveled and number of deliveries are independent
variables.
3
Multiple Linear Regression
• Extension of Simple Linear regression
• One to one IV->DV
• Extension of Simple Linear regression
• Many to One
• IV1+IV2+… ->DV
4
New Consideration
• Adding more independent variables to a multiple regression
procedure does not mean the regression will be "better" or offer
better predictions; in fact it can make things worse. This is called
OVERFITTING.
• The addition of more independent variables creates more
relationships among them. So not only are the independent variables
potentially related to the dependent variable, they are also potentially
related to each other. When this happens, it is called
MULTICOLLINEARITY.
• The ideal is for all of the independent variables to be correlated
with the dependent variable but NOT with each other.
5
New Consideration
• Because of multicollinearity and overfitting, there is a fair amount of
pre-work to do BEFORE conducting multiple regression analysis if
one is to do it properly.
• Correlations
• Scatter plots
• Simple regressions
6
Multicollinearity
• Travel time is the dependent variable and miles traveled and number
of deliveries are independent variables.
• Some independent variables or set of independent variables are
better at predicting the Dependent variable than others. Some
contribute nothing
7
Multiple Regression Model
• Multiple Regression Model
• 𝑦 = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 π‘₯2 + β‹― + 𝛽𝑝 π‘₯𝑝 + ε
• Multiple Regression Equation
• 𝐸 𝑦 = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 π‘₯2 + β‹― + 𝛽𝑝 π‘₯𝑝 Error term assumed to be zero
• Estimated Multiple Regression Equation
• 𝑦 = 𝑏0 + 𝑏1 π‘₯1 + 𝑏2 π‘₯2 + β‹― + 𝑏𝑝 π‘₯𝑝
8
Coefficient Interpretation
9
INTERPRETING COEFFICIENTS
• 𝑦 = 27 + 9π‘₯1 + 12π‘₯2
• x1 = capital investment ($1000s)
• x2 = marketing expenditures ($1000s)
• Ε· = predicted sales ($1000s)
• In multiple regression, each coefficient is interpreted as the estimated
change in y corresponding to a one unit change in a variable, when all
other variables are held constant.
• So in this example, $9000 is an estimate of the expected increase in sales
y, corresponding to a $1000 increase in capital investment (x1) when
marketing expenditures (x2) are held constant.
10
• Multiple regression is an extension of simple linear regression
• Two or more independent variables are used to predict / explain the variance in one
dependent variable
• Two problems may arise:
• Overfitting
• Multicollinearity
• Overfitting is caused by adding too many independent variables; they account for more
variance but add nothing to the model
• Multicollinearity happens when some/all of the independent variables are correlated
with each other
• In multiple regression, each coefficient is interpreted as the estimated change in y
corresponding to a one unit change in a variable, when all other variables are held
constant.
11
REGIONAL DELIVERY SERVICE
• Let's assume that you are a small business owner for Regional
Delivery Service, Inc. (RDS) who offers same-day delivery for letters,
packages, and other small cargo.
• You are able to use Google Maps to group individual deliveries into
one trip to reduce time and fuel costs. Therefore some trips will have
more than one delivery.
• As the owner, you would like to be able to estimate how long a
delivery will take based on two factors: 1) the total distance of the
trip in miles 2) the number of deliveries that must be made during
the trip and 3) the daily price of gas/petrol in U.S. dollars
12
Multicollinearity
13
MULTIPLE REGRESSION
Conducting multiple regression analysis requires a fair amount of pre-work
before actually running the regression. Here are the steps:
1. Generate a list of potential variables; independent(s) and dependent
2. Collect data on the variables
3. Check the relationships between each independent variable and the dependent variable
using scatterplots and correlations
4. Check the relationships among the independent variables using scatterplots and
correlations
5. (Optional) Conduct simple linear regressions for each IV/DV pair
6. Use the non-redundant independent variables in the analysis to find the best fitting model
7. Use the best fitting model to make predictions about the dependent variable.
14
RDS DATA AND VARIABLE NAMING
To
conduct
your
analysis you take a
random sample of 10
past trips and record
four
pieces
of
information for each
trip: 1) total miles
traveled, 2) number of
deliveries, 3) the daily
gas price, and 4) total
travel time in hours.
milesTraveled (x1)
89
66
78
111
44
77
80
66
109
76
num
travelTime(hrs),
gasPrice (x3)
Deliveries (x2)
(y)
4
1
3
6
1
3
3
2
5
3
3.84
3.19
3.78
3.89
3.57
3.57
3.03
3.51
3.54
3.25
7
5.4
6.6
7.4
4.8
6.4
7
5.6
7.3
6.4
15
16
SCATTERPLOT SUMMARY
• Dependent variable vs independent variables
• travelTime(y) appears highly correlated with milesTraveled (x1)
• travelTime(y) appears highly correlated with numDeliveries (x2)
• travelTime(y) DOES NOT appear highly correlated with gasPrice(x3)
• Since gasPrice (x3) does NOT APPEAR CORRELATED with the
dependent variable we would NOT use that variable in the multiple
regression
• Note: for now, we will keep gasPrice in and then take it out later for
learning purposes
17
18
Multicollinearity Check
• Independent variable vs independent variable
• numDeliveries(x2) APPEARS highly correlated with milesTraveled (x,); this
is multicollinearity
• milesTraveled (x1) does not appear highly correlated with gasPrice(x3)
• gasPrices(x3) does not appear correlated with numDeliveries(x2)
• Since numDeliveries is HIGHLY CORRELATED with
milesTraveled, we would NOT use BOTH in the multiple regression;
they are redundant
• Note: for now, we will keep both in and then take one out later for
learning purposes
19
Correlations
num
milesTravel
Correlation
Deliveries
ed (x1)
(x2)
num Deliveries
(x2)
0.956
gasPrice (x3)
0.356
0.498
travelTime(hrs)
, (y)
0.928
0.916
gasPrice
(x3)
0.267
20
Scatterplot (DV vs IV)
r=0.928
r=0.916
r=0.267
21
IV Scatterplots(Multicollinearity)
r= 0.956
r=0.356
r=0.498
22
Understanding
• Correlation analysis confirms the
conclusions reached by visual
examination of the scatterplots
• Redundant multicollinear variables
• milesTraveled and numDeliveries are
both highly correlated with each other
and therefore are redundant; only one
should used in the multiple regression
analysis
• Non-contributing variables
• gasPrice is NOT correlated with the
depended variable and should be
excluded
• For the sake of learning, we are going
to break the rules and include all three
independent
variables
in
the
regression at first
• Then we will remove the problematic
independent variables as we should
and then watch what happens to the
regression results
• We will also perform simple
regressions with the dependent
variable to use as a baseline (again for
the sake of learning)
23
Example
BP
105
115
116
117
112
121
121
110
110
114
114
115
114
106
125
114
106
113
110
122
Weight
85.4
94.2
95.3
94.7
89.4
99.5
99.8
90.9
89.2
92.7
94.4
94.1
91.6
87.1
101.3
94.5
87
94.5
90.5
95.7
Age
47
49
49
50
51
48
49
47
49
48
47
49
50
45
52
46
46
46
48
56
BSA
1.75
2.1
1.98
2.01
1.89
2.25
2.25
1.9
1.83
2.07
2.07
1.98
2.05
1.92
2.19
1.98
1.87
1.9
1.88
2.09
Dur
5.1
3.8
8.2
5.8
7
9.3
2.5
6.2
7.1
5.6
5.3
5.6
10.2
5.6
10
7.4
3.6
4.3
9
7
Pulse
63
70
72
73
72
71
69
66
69
64
74
71
68
67
76
69
62
70
71
75
Stress
33
14
10
99
95
10
42
8
62
35
90
21
47
80
98
95
18
12
99
99
BP
BP
Age
Weight
BSA
Dur
Pulse
Stress
Age
Weight
BSA
Dur
1
0.659093
1
0.950068 0.407349
1
0.865879 0.378455 0.875305
1
0.292834 0.343792 0.20065 0.13054
1
0.721413 0.618764 0.65934 0.464819 0.401514
0.163901 0.368224 0.034355 0.018446 0.31164
Pulse
1
0.50631
Stress
1
24
Variance Inflammation Factor (VIF)
• A variance inflation factor (VIF) provides a measure of
multicollinearity among the independent variables in a multiple
regression model.
• Detecting multicollinearity is important
because while
multicollinearity does not reduce the explanatory power of the
model, it does reduce the statistical significance of the independent
variables.
• A large VIF on an independent variable indicates a highly collinear
relationship to the other variables that should be considered or
adjusted for in the structure of the model and selection of
independent variables.
25
Variance Inflammation Factor (VIF)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.998073
R Square
0.99615
Adjusted R
Square
0.994373
Standard
Error
0.407229
Observation
s
20
ANOVA
df
Regression
Residual
Total
Intercept
Age
Weight
BSA
Dur
Pulse
Stress
SS
6
13
19
557.8441
2.155858
560
MS
F
92.97402
0.165835
560.641
Significance
F
6.4E-15
8.42, 5.33, and 4.41 — are fairly
large. The VIF for the predictor
Weight, for example, tells us that
the variance of the estimated
coefficient of Weight is inflated by
a factor of 8.42 because Weight is
highly correlated with at least one
of the other predictors in the
model.
Standard
Coefficients Error
t Stat
P-value
Lower 95% Upper 95% Lower 95.0% Upper 95.0% VIF
R Squared
-12.8705
2.55665
-5.03412
0.000229
-18.3938
-7.34717
-18.3938
-7.34717
0.703259
0.049606
14.17696
2.76E-09
0.596093
0.810426
0.596093
0.810426
1.76
0.96992
0.063108
15.36909
1.02E-09
0.833582
1.106257
0.833582
1.106257
8.417035
0.88119332
3.776491
1.580151
2.389956
0.032694
0.362783
7.190199
0.362783
7.190199
5.33
0.068383
0.048441
1.411663
0.181534
-0.03627
0.173035
-0.03627
0.173035
1.24
-0.08448
0.051609
-1.63702
0.125594
-0.19598
0.02701
-0.19598
0.02701
4.41
0.005572
0.003412
1.63277
0.126491
-0.0018
0.012943
-0.0018
0.012943
1.834845
0.45499493
26
Variance Inflammation Factor (VIF)Weight as Dependent variable
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.938719
R Square 0.881193
Adjusted
R Square 0.838762
Standard
Error
1.724594
Observati
ons
20
ANOVA
df
Regressio
n
Residual
Total
SS
MS
5 308.8389 61.76777
14 41.63913 2.974223
19 350.478
F
Significan
ce F
20.7677 5.05E-06
Coefficien Standard
Lower
Upper
Lower
Upper
ts
Error
t Stat
P-value
95%
95%
95.0%
95.0%
Intercept 19.67444 9.464742 2.078708 0.056513 -0.62541 39.97429 -0.62541 39.97429
Age
-0.14464 0.206491 -0.70048 0.495103 -0.58752 0.298235 -0.58752 0.298235
BSA
21.42165 3.464586 6.183034 2.38E-05 13.99086 28.85245 13.99086 28.85245
Dur
0.008696 0.205134 0.042394 0.966783 -0.43127 0.448665 -0.43127 0.448665
Pulse
0.557697 0.159853 3.488813 0.003615 0.214847 0.900548 0.214847 0.900548
Stress
-0.023 0.013079 -1.75834 0.10052 -0.05105 0.005054 -0.05105 0.005054
27
Understanding
• we see that the predictors Weight and BSA are highly correlated (r = 0.875). We can
choose to remove either predictor from the model. The decision of which one to
remove is often a scientific or practical one.
• For example, if the researchers here are interested in using their final model to
predict the blood pressure of future individuals, their choice should be clear. Which
of the two measurements — body surface area or weight — do you think would be
easier to obtain?!
• If weight is an easier measurement to obtain than body surface area, then the
researchers would be well-advised to remove BSA from the model and leave Weight
in the model.
• Reviewing again the above pairwise correlations, we see that the predictor Pulse also
appears to exhibit fairly strong marginal correlations with several of the predictors,
including Age (r = 0.619), Weight (r = 0.659), and Stress (r = 0.506). Therefore, the
researchers could also consider removing the predictor Pulse from the model.
28
After removal
SUMMARY OUTPUT
BP
105
115
116
117
112
121
121
110
110
114
114
115
114
106
125
114
106
113
110
122
Weight
85.4
94.2
95.3
94.7
89.4
99.5
99.8
90.9
89.2
92.7
94.4
94.1
91.6
87.1
101.3
94.5
87
94.5
90.5
95.7
Age
47
49
49
50
51
48
49
47
49
48
47
49
50
45
52
46
46
46
48
56
Dur
5.1
3.8
8.2
5.8
7
9.3
2.5
6.2
7.1
5.6
5.3
5.6
10.2
5.6
10
7.4
3.6
4.3
9
7
Stress
33
14
10
99
95
10
42
8
62
35
90
21
47
80
98
95
18
12
99
99
Regression Statistics
Multiple R
0.995934
R Square
0.991884
Adjusted R
Square
0.989719
Standard
Error
0.550462
Observatio
ns
20
BP
BP
Weight
Age
Dur
Stress
Weight
Age
1
0.950068
1
0.659093 0.407349
1
0.292834 0.20065 0.343792
0.163901 0.034355 0.368224
Dur
Stress
1
0.31164
1
ANOVA
df
SS
MS
F
Significance
F
555.4549
4.545126
560
138.8637
0.303008
458.2834
1.76E-15
Regression
Residual
Total
4
15
19
Intercept
Weight
Age
Dur
Stress
Standard
Coefficients
Error
-15.8698 3.195296
1.034128 0.032672
0.683741 0.061195
0.039889 0.064486
0.002184 0.003794
t Stat
-4.96662
31.65228
11.17309
0.618561
0.575777
P-value Lower 95% Upper 95%
0.000169 -22.6804 -9.05922
3.76E-15 0.964491 1.103766
1.14E-08 0.553306 0.814176
0.545485 -0.09756 0.177339
0.573304
-0.0059
0.01027
Lower
95.0%
-22.6804
0.964491
0.553306
-0.09756
-0.0059
Upper
95.0%
-9.05922
1.103766
0.814176
0.177339
0.01027
VIF
R squared
1.234653
1.47
1.2
1.24
0.190056
29
• The remaining variance inflation factors are quite satisfactory!
That is, it appears as if hardly any variance in inflation remains.
Incidentally, in terms of the adjusted R squared we did not seem
to lose much by dropping the two predictors BSA and Pulse
from our model. The adjusted R squared decreased to only
98.97% from the original adjusted of 99.44%.
30
Dummy Variable
31
Scenario – Dummy Variable
• You are an analyst for a small company that develops house pricing
models for independent realtors. To generate your models you use
publicly available data such as; list price, square footage, number of
bedrooms, number of bathrooms, etc.
• You are interested in another question: Is the public high school in
the neighborhood "Exemplary" (the highest rating) and how is that
rating related to home price?
• High school rating is not quantitative, it is qualitative (categorical).
For each home the high school is either exemplary or not; yes or no.
32
Price1000s(y)
145
69.9
315
144.9
134.9
369
95
228.9
149
295
388.5
75
130
174
334.9
sqrt(x1)
1872
1954
4104
1524
1297
3278
1192
2252
1620
2466
3188
1061
1195
1552
2901
exempHS(x2)
Not exemplary
Not exemplary
Exemplary
Not exemplary
Not exemplary
Exemplary
Not exemplary
Exemplary
Not exemplary
Exemplary
Exemplary
Not exemplary
Not exemplary
Exemplary
Exemplary
Price1000s(y)
145
69.9
315
144.9
134.9
369
95
228.9
149
295
388.5
75
130
174
334.9
sqrt(x1)
1872
1954
4104
1524
1297
3278
1192
2252
1620
2466
3188
1061
1195
1552
2901
exempHS(x2)
0
0
1
0
0
1
0
1
0
1
1
0
0
1
1
33
DUMMY VARIABLES
• In many situations we must work with categorical independent
variables
• In regression analysis we call these dummy or indicator variables
• For a variable with n categories there are always n-1 dummy
variables
• Exemplary/Not exemplary there are 2 categories, so 2 − 1 = 1 dummy
variable
• North/South/East/West there are 4 categories, so 4 - 1 = 3 dummy variables
34
Region Variable Coding
Region
North
South
East
West
x1
1
0
0
0
x2
0
1
0
0
x3
0
0
1
0
35
Interpretation
• 𝐸 𝑦 = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 π‘₯2 Estimated Regression Equation
• Expected value of home price given the high school is Not exemplary ,
π‘₯2 = 0
𝐸(𝑦 π‘π‘œπ‘‘ 𝑒π‘₯π‘’π‘šπ‘π‘™π‘Žπ‘Ÿπ‘¦) = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 (0)
𝐸(𝑦 π‘π‘œπ‘‘ 𝑒π‘₯π‘’π‘šπ‘π‘™π‘Žπ‘Ÿπ‘¦) = 𝛽0 + 𝛽1 π‘₯1
• Expected value of home price given the high school is Not exemplary ,
π‘₯2 = 1
𝐸(𝑦 𝐸π‘₯π‘’π‘šπ‘π‘™π‘Žπ‘Ÿπ‘¦) = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 (1)
𝐸(𝑦 𝐸π‘₯π‘’π‘šπ‘π‘™π‘Žπ‘Ÿπ‘¦) = (𝛽0 + 𝛽2 ) + 𝛽1 π‘₯1
36
Regression Analysis
• Sqrt: Every square foot is related to an
increase in price of 0.0621($1000s) or
$62.10 per square foot.
• ExempHS: On average, a home in an
area with an exemplary high school is
related to a $98600 higher price.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.925716
R Square
0.85695
Adjusted R
Square
0.833109
Standard
Error
44.65395
Observations
15
ANOVA
df
Regression
Residual
Total
Intercept
sqrt(x1)
exempHS(x2)
SS
MS
F
143340.7
23927.7
167268.4
71670.36
1993.975
35.94346
Coefficients Standard Error
27.07494
33.68996
0.062066
0.020324
t Stat
0.80365
3.05383
P-value
Lower 95%
Upper 95% Lower 95.0% Upper 95.0%
0.437229
-46.3292
100.479
-46.3292
100.479
0.010013
0.017784
0.106348
0.017784
0.106348
2.743023
0.017831
2
12
14
98.64787
35.96319
Significance F
8.57E-06
20.2908
177.0049
20.2908
177.0049
37
Interpretation
• 𝐸 𝑦 = 27.1 + 0.0621 π‘₯1 +
98.6π‘₯2 Estimated Regression Equation
• Expected value of home price given the high school is
Not exemplary , π‘₯2 = 0
𝐸(𝑦 π‘π‘œπ‘‘ 𝑒π‘₯π‘’π‘šπ‘π‘™π‘Žπ‘Ÿπ‘¦) = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 (0)
𝐸(𝑦 π‘π‘œπ‘‘ 𝑒π‘₯π‘’π‘šπ‘π‘™π‘Žπ‘Ÿπ‘¦) = 27.1 + 0.0621 π‘₯1
• Expected value of home price given the high school is
Not exemplary , π‘₯2 = 1
𝐸(𝑦 𝐸π‘₯π‘’π‘šπ‘π‘™π‘Žπ‘Ÿπ‘¦) = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 (1)
𝐸(𝑦 𝐸π‘₯π‘’π‘šπ‘π‘™π‘Žπ‘Ÿπ‘¦) = (27.1 + 98.6) + 0.0621 π‘₯1
38
• You are an analyst for a small company that develops house pricing models for
independent realtors. To generate your models you use publicly available data such
as; list price, square footage, number of bedrooms, number of bathrooms, etc.
• You are interested in two questions:
1. Is the public high school in the neighborhood "Exemplary" (the highest
rating) and how is that rating related to home price?
2. What region (N,S,E,W) of the city is the home located and how is that
related to home price?
• This data is not quantitative, it is qualitative (categorical) so we will need to use
dummy variables in our regression.
39
Price1000s(y) sqrt(x1) exempHS(x2) Region
145
1872
Not exemplary South
69.9
1954
Not exemplary North
315
4104
Exemplary
South
144.9
1524
Not exemplary North
134.9
1297
Not exemplary
East
369
3278
Exemplary
South
95
1192
Not exemplary
West
228.9
2252
Exemplary
North
149
1620
Not exemplary
West
295
2466
Exemplary
South
388.5
3188
Exemplary
South
75
1061
Not exemplary South
130
1195
Not exemplary
East
174
1552
Exemplary
South
334.9
2901
Exemplary
South
Price1000s(y) sqrt(x1) exempHS(x2) South(x3) West(x4) North(x5)
145
1872
0
1
0
0
69.9
1954
0
0
0
1
315
4104
1
1
0
0
144.9
1524
0
0
0
1
134.9
1297
0
0
0
0
369
3278
1
1
0
0
95
1192
0
0
1
0
228.9
2252
1
0
0
1
149
1620
0
0
1
0
295
2466
1
1
0
0
388.5
3188
1
1
0
0
75
1061
0
1
0
0
130
1195
0
0
0
0
174
1552
1
1
0
0
334.9
2901
1
1
0
0
40
41
42
For home with 2500 square feet having high school exemplary the
price is premium, but the price is less for the case high school is not
exemplary.
Also the price is less for the west region, high for the south region.
Moderate for other two.
43
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.940779
R Square
0.885066
Adjusted R Square
0.821214
Standard Error
46.2179
Observations
15
ANOVA
df
Regression
Residual
Total
Intercept
sqrt(x1)
exempHS(x2)
South(x3)
West(x4)
North(x5)
5
9
14
SS
148043.6
19224.85
167268.4
Coefficients
51.30076
0.065128
102.1551
-32.1221
-20.8704
-61.8466
Standard
Error
42.29369
0.021546
40.13882
44.4748
46.34628
43.87802
MS
29608.71743
2136.094023
F
13.861149
Significance F
0.000527941
t Stat
1.212964705
3.022764609
2.545046236
-0.72225415
-0.450315462
-1.409511619
P-value
0.25601839
0.01441461
0.03144932
0.4884789
0.66313455
0.19228908
Lower 95%
-44.37422483
0.016387876
11.3548324
-132.7311102
-125.7130273
-161.1055453
Upper 95%
146.9757412
0.113867729
192.9554514
68.48688566
83.9721305
37.41239569
Lower 95.0%
-44.37422483
0.016387876
11.3548324
-132.7311102
-125.7130273
-161.1055453
Upper 95.0%
146.9757412
0.113867729
192.9554514
68.48688566
83.9721305
37.41239569
44
Interpretation
• 𝐸 𝑦 = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 + 𝛽4 π‘₯4 +
𝛽5 π‘₯5 Estimated Regression Equation
• Expected value of home price given the high school is Not exemplary , π‘₯2 = 0
and in west π‘₯4 = 1
𝐸(𝑦 π‘π‘œπ‘‘ 𝑒π‘₯π‘’π‘šπ‘π‘™π‘Žπ‘Ÿπ‘¦ π‘Žπ‘›π‘‘ 𝑀𝑒𝑠𝑑) = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 (0)+𝛽3 (0) + 𝛽4 (1) + 𝛽5 (0)
𝐸(𝑦 π‘π‘œπ‘‘ 𝑒π‘₯π‘’π‘šπ‘π‘™π‘Žπ‘Ÿπ‘¦ π‘Žπ‘›π‘‘ 𝑀𝑒𝑠𝑑) = 𝛽0 + 𝛽1 π‘₯1 + 𝛽4
45
• It appears from this data and its analysis:
• Higher square footage is a significant predictor of higher home price
• Being in a district with an exemplary high school is a significant predictor of
higher home price
• Region is NOT a significant predictor of higher home price.
46
Any
Queries?
Thank you
Download