Day 3 Slides - Grinnell College

advertisement
Shonda Kuiper
Grinnell College
Statistical techniques taught in introductory statistics courses typically
have one response variable and one explanatory variable.
Response variable measures the outcome of a study.
Explanatory variable explain changes in the response variable.
Explanatory Variable
Response
Variable
Each variable can be classified as either categorical or quantitative.
Categorical data place individuals into one of several groups (such as
red/blue/white, male/female or yes/no).
Quantitative data consists of numerical values for which most arithmetic
operations make sense.
Explanatory Variable
Response
Variable
Categorical
Categorical
Quantitative
Chi-Square test
Logistic Regression
Two proportion test
Quantitative
Two-sample t-test
ANOVA
Regression
Statistical models have the following form:
observed value = mean response + random error
Generic Group: πœ‡1 = π‘Œ1 = (70+82+90+78)/4 = 80
Brand Name Group: πœ‡2 = π‘Œ2 = (75+85+95+85)/4 = 85
π‘Œπ‘–π‘—
=
π‘Œπ‘–
+
πœ€π‘–π‘—
π‘Œ70
11
π‘Œ80
1
πœ€-10
11
π‘Œ82
12
π‘Œ80
1
πœ€122
π‘Œ90
13
π‘Œ80
1
10
πœ€13
π‘Œ78
14
=
π‘Œ80
1
++
πœ€14
-2
π‘Œ75
21
π‘Œ85
2
πœ€-10
21
π‘Œ85
22
π‘Œ85
2
πœ€220
π‘Œ95
23
π‘Œ85
24
π‘Œ85
2
π‘Œ85
2
πœ€23
10
πœ€240
where i =1,2
j = 1,2,3,4
π‘Œ2 = πœ‡1 = 85
π‘Œ1 = πœ‡1 = 80
μ2
μ1
Null Hypothesis: the two groups of batteries last the same amount
of time
𝐻0 : μ1 = μ2
πœ€1,3 = 90 − 80
π‘Œ2 = 85
= 10
π‘Œ1 = 80
μ1
ε1,3 = 90 − μ1
= ?
μ2
The theoretical model used in the two-sample t-test is designed to
account for these two group means (µ1 and µ2) and random error.
observed
mean
random
value
= response + error
π‘Œπ‘–π‘—
=
π‘Œπ‘–
+
πœ€π‘–π‘—
where i =1,2
j = 1,2,3,4
π‘Œπ‘–π‘—
=
πœ‡π‘–
+
πœ€π‘–π‘—
where i =1,2
j = 1,2,3,4
Null Hypothesis: 𝐻0 : μ1 = μ2
Alternative Hypothesis: 𝐻1 : μ2 ≠ μ1
ANOVA: Instead of using two group means, we break the mean response
into a grand mean, πœ‡, two group effects (𝛼1 and 𝛼2).
πœ‡ = π‘Œ = (70 + 82 + 90 + 78 + 75 + 85 + 95 + 85)/8 = 82.5
𝛼1 = πœ‡1 − πœ‡ = 80 − 82.5 = —2.5
𝛼2 = πœ‡2 − πœ‡ = 85 + 82.5 = 2.5
π‘Œπ‘–,𝑗 = { π‘Œ
+
𝛼𝑖 }
+ πœ€π‘–,𝑗
70
82.5
-2.5
-10
82
82.5
-2.5
2
90
82.5
-2.5
10
78
=
82.5
+
-2.5
+
-2
75
82.5
2.5
-10
85
82.5
2.5
0
95
82.5
2.5
10
85
82.5
2.5
0
where i = 1,2
and j = 1,2,3,4
𝛼2 = 2.5
π‘Œ2 = 85
π‘Œ1 = 80
𝛼1 = πœ‡1 − πœ‡ = —2.5
μ1
μ2
πœ‡ = π‘Œ = 82.5
observed
mean
random
value
= response + error
π‘Œπ‘Œπ‘–,𝑗𝑖𝑗
π‘Œπ‘–,𝑗
=
πœ‡π‘–π‘–
++
πœ€πœ€π‘–,𝑗𝑖𝑗
= {πœ‡ + 𝛼𝑖 } + πœ€π‘–,𝑗
𝐻0 : πœ‡1 = πœ‡2
𝐻0 : πœ‡ + 𝛼1 = πœ‡ + 𝛼2
𝐻0 : 𝛼1 = 𝛼2
Null Hypothesis: 𝐻0 : 𝛼1 = 𝛼2
Alternative Hypothesis: 𝐻1 : 𝛼1 ≠ 𝛼2
where i =1,2
j = 1,2,3,4
Regression: Instead of using two group means, we create a model for a
straight line (using 𝛽0 and 𝛽1 ).
observed
mean
random
value
= response + error
π‘Œπ‘Œπ‘–,𝑗𝑖𝑗
π‘Œπ‘–
=
=
πœ‡π‘–π‘–
+
πœ€πœ€π‘–,𝑗𝑖𝑗
𝛽0 + 𝛽1 𝑋𝑖 +
πœ€π‘–
π‘€β„Žπ‘’π‘Ÿπ‘’ Xi is either 0 or 1
π‘Šβ„Žπ‘’π‘› Xi = 0,
𝛽0 + 𝛽1 βˆ™ 0 = 𝛽0 = πœ‡1
π‘Šβ„Žπ‘’π‘› Xi = 1,
𝛽0 + 𝛽1 βˆ™ 1 = 𝛽0 + 𝛽1 = πœ‡2
𝐻0 : πœ‡2 − πœ‡1 = 0
𝐻0 : 𝛽0 + 𝛽1 − 𝛽0 = 0
𝐻0 : 𝛽1 = 0
where i =1,2
j = 1,2,3,4
where i = 1,2, …, 8
π‘Œπ‘– = 80 + 5 ∗ 𝑋𝑖
Regression: Instead of using two group means, we create a model
for a straight line (using 𝛽0 and 𝛽1 ).
𝛽0 = πœ‡1 = 80
𝛽1 = πœ‡2 −πœ‡1 = 85 − 80 = 5
π‘Œπ‘–
=
𝛽0
+ 𝛽1 𝑋𝑖
+ πœ€π‘–
70
80
0
-10
82
80
0
2
90
80
0
10
78
=
80
+
0
+
-2
75
80
5
-10
85
80
5
0
95
80
5
10
85
80
5
0
where i = 1,2,…,8
Regression: Instead of using two group means, we create a model
for a straight line (using 𝛽0 and 𝛽1 ).
The equation for the line is often written as:
π‘Œπ‘–
=
𝛽0
+ 𝛽1 𝑋𝑖
80
80
0
80
80
0
80
80
0
80
=
80
+
0
85
80
5
85
80
5
85
80
5
85
80
5
where i = 1,2,…,8
When there are only two groups (and we have the same
assumptions), all three models are algebraically equivalent.
π‘Œπ‘–π‘—
=
πœ‡π‘–
+
πœ€π‘–π‘—
where i =1,2
j = 1,2,3,4
𝐻0 : μ1 = μ2
π‘Œπ‘–,𝑗
= {πœ‡ + 𝛼𝑖 } + πœ€π‘–,𝑗
where i =1,2
j = 1,2,3,4
𝐻0 : 𝛼1 = 𝛼2
π‘Œπ‘–
=
𝛽0 + 𝛽1 𝑋𝑖 +
πœ€π‘–
𝐻0 : 𝛽1 = 0
where i = 1,2, …, 8
Shonda Kuiper
Grinnell College
• Multiple regression analysis can be used to serve different goals.
The goals will influence the type of analysis that is conducted. The
most common goals of multiple regression are to:
• Describe: A model may be developed to describe the relationship
between multiple explanatory variables and the response variable.
• Predict: A regression model may be used to generalize to observations
outside the sample.
• Confirm: Theories are often developed about which variables or
combination of variables should be included in a model. Hypothesis
tests can be used to evaluate the relationship between the
explanatory variables and the response.
• Build a multiple regression model to predict retail price of cars
• Price = 35738 – 0.22 Mileage
R-Sq: 4.1%
• Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Questions:
οƒ˜ What happens to Price as Mileage
increases?
• Build a multiple regression model to predict retail price of cars
• Price = 35738 – 0.22 Mileage
R-Sq: 4.1%
• Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Questions:
οƒ˜ What happens to Price as Mileage
increases?
οƒ˜ Since b1 = -0.22 is small can we
conclude it is unimportant?
• Build a multiple regression model to predict retail price of cars
• Price = 35738 – 0.22 Mileage
R-Sq: 4.1%
• Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Questions:
οƒ˜ What happens to Price as Mileage
increases?
οƒ˜ Since b1 = -0.22 is small can we
conclude it is unimportant?
οƒ˜ Does mileage help you predict price?
What does the p-value tell you?
• Build a multiple regression model to predict retail price of cars
• Price = 35738 – 0.22 Mileage
R-Sq: 4.1%
• Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Questions:
οƒ˜ What happens to Price as Mileage
increases?
οƒ˜ Since b1 = -0.22 is small can we
conclude it is unimportant?
οƒ˜ Does mileage help you predict price?
What does the p-value tell you?
οƒ˜ Does mileage help you predict price?
What does the R-Sq value tell you?
• Build a multiple regression model to predict retail price of cars
• Price = 35738 – 0.22 Mileage
R-Sq: 4.1%
• Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Questions:
οƒ˜ What happens to Price as Mileage
increases?
οƒ˜ Since b1 = -0.22 is small can we
conclude it is unimportant?
οƒ˜ Does mileage help you predict price?
What does the p-value tell you?
οƒ˜ Does mileage help you predict price?
What does the R-Sq value tell you?
οƒ˜ Are there outliers or influential
observations?
0
What happens when all the points fall on the regression line?
What happens when the regression line does not help us estimate Y?
What happens when the regression line does not help us estimate Y?
What happens when the regression line does not help us estimate Y?
What happens when the regression line does not help us estimate Y?
What happens when the regression line does not help us estimate Y?
What happens when the regression line does not help us estimate Y?
What happens when the regression line does not help us estimate Y?
• R2adj includes a penalty when more terms are included in the
model.
• n is the sample size and p is the number of coefficients (including
the constant term β0, β1, β2, β3,…, βp-1)
• When many terms are in the model:
• p is larger
(n – 1)/(n-p) is larger
R2adj is smaller
Price = 35738 – 0.22 Mileage
R-Sq: 4.1%
Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Shonda Kuiper
Grinnell College
• Build a multiple regression model to predict retail price of cars
R2 = 2%
Scatterplot of Price vs Mileage
70000
60000
Price
50000
40000
30000
20000
10000
0
0
10000
20000
30000
Mileage
40000
50000
• Build a multiple regression model to predict retail price of cars
R2 = 2%
Scatterplot of Price vs Mileage
70000
Mileage
60000
Cylinder
Price
50000
Liter
40000
Leather
30000
Cruise
20000
10000
Doors
0
Sound
0
10000
20000
30000
Mileage
40000
50000
• Build a multiple regression model to predict retail price of cars
R2 = 2%
Scatterplot of Price vs Mileage
70000
Mileage
60000
Cylinder
Price
50000
Liter
40000
Leather
30000
Cruise
20000
10000
Doors
0
Sound
0
10000
20000
30000
Mileage
40000
50000
Price = 6759 + 6289Cruise + 3792Cyl -1543Doors +
3349Leather - 787Liter -0.17Mileage - 1994Sound
R2 = 44.6%
Step Forward Regression (Forward Selection):
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise
R2 = 18.56%
Step Forward Regression:
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise
R2 = 18.56%
Price = -17.06 + 4054.2Cyl
R2 = 32.39%
Step Forward Regression:
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise
R2 = 18.56%
Price = -17.06 + 4054.2Cyl
R2 = 32.39%
Price = 24764.6 – 0.17Mileage
R2 = 2.04%
Step Forward Regression:
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise
R2 = 18.56%
Price = -17.06 + 4054.2Cyl
R2 = 32.39%
Price = 24764.6 – 0.17Mileage
R2 = 2.04%
Price = 6185.8.6 + 4990.4Liter
R2 = 31.15%
Step Forward Regression:
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise
R2 = 18.56%
Price = -17.06 + 4054.2Cyl
R2 = 32.39%
Price = 24764.6 – 0.17Mileage
R2 = 2.04%
Price = 6185.8.6 + 4990.4Liter
R2 = 31.15%
Price = 23130.1 – 2631.4Sound
R2 = 1.55%
Price = 18828.8 + 3473.46Leather
R2 = 2.47%
Price = 27033.6 -1613.2Doors
R2 = 1.93%
Step Forward Regression:
Which combination of two terms best predicts Price?
Price = - 17.06 + 4054.2Cyl
R2 = 32.39%
Price = -1046.4 + 3392.6Cyl + 6000.4Cruise
R2 = 38.4% (38.2%)
Step Forward Regression:
Which combination of two terms best predicts Price?
Price = - 17.06 + 4054.2Cyl
R2 = 32.39%
Price = 3145.8 + 4027.6Cyl – 0.152Mileage
R2 = 34% (33.8)
Step Forward Regression:
Which combination of two terms best predicts Price?
Price = -17.06 + 4054.2Cyl
R2 = 32.39%
Price = 1372.4 + 2976.4Cyl + 1412.2Liter
R2 = 32.6% (32.4%)
Step Forward Regression:
Which combination of terms best predicts Price?
Price = -17.06 + 4054.2Cyl
R2 = 32.39%
Price = -1046.4 + 3393Cyl + 6000.4Cruise
R2 = 38.4% (38.2%)
Price = -2978.4 + 3276Cyl +6362Cruise + 3139Leather
Price =
R2 = 40.4% (40.2%)
412.6 + 3233Cyl +6492Cruise + 3162Leather
-0.17Mileage
R2 = 42.3% (42%)
Price = 5530.3 + 3258Cyl +6320Cruise + 2979Leather
-0.17Mileage – 1402Doors
R2 = 43.7% (43.3%)
Price = 7323.2 + 3200Cyl + 6206Cruise + 3327Leather
-0.17Mileage – 1463Doors – 2024Sound
R2 = 44.6% (44.15%)
Price = 6759 + 3792Cyl + 6289Cruise + 3349Leather -787Liter
-0.17Mileage -1543Doors - 1994Sound
R2 = 44.6% (44.14%)
Step Forward Regression:
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise
R2 = 18.56%
Price = -17.06 + 4054.2Cyl
R2 = 32.39%
Price = 24764.6 – 0.17Mileage
R2 = 2.04%
Price = 6185.8.6 + 4990.4Liter
R2 = 31.15%
Price = 23130.1 – 2631.4Sound
R2 = 1.55%
Price = 18828.8 + 3473.46Leather
R2 = 2.47%
Price = 27033.6 -1613.2Doors
R2 = 1.93%
Step Backward Regression (Backward Elimination):
Price = 6759 + 3792Cyl + 6289Cruise + 3349Leather -787Liter
-0.17Mileage -1543Doors - 1994Sound
R2 = 44.6% (44.14%)
Price = 7323.2 + 3200Cyl + 6206Cruise + 3327Leather
-0.17Mileage – 1463Doors – 2024Sound
R2 = 44.6% (44.15%)
Bidirectional stepwise procedures
Other techniques, such as Akaike information criterion, Bayesian
information criterion, Mallows’ Cp, are often used to find the best
model.
Best Subsets Regression:
Here we see that
Liter is the second
best single predictor
of price.
Important Cautions:
• Stepwise regression techniques can often ignore very important
explanatory variables. Best subsets is often preferable.
• Both best subsets and stepwise regression methods only consider
linear relationships between the response and explanatory
variables.
• Residual graphs are still essential in validating whether the model
is appropriate.
• Transformations, interactions and quadratic terms can often
improve the model.
• Whenever these iterative variable selections techniques are used,
the p-values corresponding to the significance of each individual
coefficient are not reliable.
Download