Statistical Modeling

advertisement
US Army Logistics Management College
Part 1: Regression Analysis
Estimating Relationships
Statistical Modeling - 1
US Army Logistics Management College
Preparing to Use Stat Tools
Pharmex.xls
Pharmex Drug Stores
• Stat Tools is a part of the Decision Tools Suite
• Open both Excel and Stat Tools.
• Select StatTools + Data Set Manager
• Select New
• Highlight the portion of the spreadsheet that
includes the data and select OK
Statistical Modeling - 2
US Army Logistics Management College
Scatterplots: Graphing Relationships
Pharmex Drug Stores
Pharmex.xls
• Pharmex is a chain of drugstores that operates around the country.
• The company has collected data from 50 randomly selected
metropolitan regions. In each region it has collected data on its
promotional expenditures and sales in the region over the past
year.
• There are two variables each of which are indexes, not dollar
amounts.
• Promote: Pharmex’s promotional expenditures as a percentage
of those of the leading competitor.
• Sales: Pharmex’s sales as a percentage of those of the leading
competitor.
• The company expects that there is a positive relationship between
the two variables, so that regions with relatively more expenditures
have relatively more sales. However, it is not clear what the nature
of this relationship is.
Statistical Modeling - 3
US Army Logistics Management College
Creating the Scatterplot
Pharmex Drug Stores
Pharmex.xls
• The tricky part is to decide which variable should be on
the horizontal axis.
• Select any data cell.
• Select StatTools + Summary Graphs + Scatterplot…
•
In regression analysis, we always put the explanatory
variable on the horizontal axis and the response variable
on the vertical axis. In this example the store tends to
believe that large promotional expenditures “cause”
larger values of sales, so select “Sales” as the Y variable
(the vertical axis)..
• Select “Promote” as the X variable (the horizontal axis).
Statistical Modeling - 4
US Army Logistics Management College
Interpretation
Pharmex Drug Stores
• The scatterplot indicates that there is a positive
relationship between Promote and Sales - the points
tend to rise from bottom left to top right - but the
relationship is not perfect.
• The correlation of 0.673 is shown automatically on the
plot. The important things to note about the correlation is
that it is positive and its magnitude is moderately large.
• Causation - we can never make definitive statements
about causation based on regression analysis.
Regression identifies only a statistical relationship, not a
causal relationship
Statistical Modeling - 5
US Army Logistics Management College
Simple Linear Regression
Pharmex Drug Stores
• The Pharmex scatterplot hints at a linear relationship between
Promote and Sales. We want to draw the “best fitting” straight
line through the points to quantify that linear relationship.
• Since the relationship is not perfect, not all points lie exactly
on the line. The differences are the residuals. They show how
much the observed values differ from the fitted values. The
fitted value is the vertical distance from the horizontal axis to
the line .
• We decide to define “best fitting” line through the points in the
scatterplot to be the one with the smallest sum of the squared
residuals. This line is called the least squares line
• We now want to find the least squares line for the Pharmex
drugstore data, using Sales as the response variable and
Promote as the explanatory variable.
Statistical Modeling - 6
US Army Logistics Management College
Least Squares Line with StatTools
Pharmex Drug Stores
• Select any data cell.
• From the Menu bar, select :
StatTools
+ Regression & Classification
+ Regression…
• Specify that “Sales” is the
response (dependent) variable.
• Specify that “Promote” is the
explanatory (independent)
variable.
• Select graph option: “Residuals
vs Fitted values”
Statistical Modeling - 7
US Army Logistics Management College
Regression Output Table
Pharmex Drug Stores
• The “Constant” and “Promote” coefficients B18:C18 imply
that the equation for the least squares line is:
Predicted Sales = 25.1264 + (0.7623 x Promote)
Statistical Modeling - 8
US Army Logistics Management College
Least Square Line Equation
Pharmex Drug Stores
We can interpret this equation as follows:
• The slope 0.7623 indicates that the sales index tends to increase by
about 0.76 for each unit increase in the promotional expenses index.
• The interpretation of the intercept is less important. It is literally the
predicted sales index for a region that does no promotions.
The Scatterplot
• A useful graph in almost any regression analysis is a
scatterplot of residuals (on the vertical axis) versus fitted
values.
• We typically examine the scatterplot for striking patterns. A
“good” fit not only has small residuals, but it has residuals
scattered randomly around 0 with no apparent pattern. This is
the case here.
Statistical Modeling - 9
US Army Logistics Management College
The Scatterplot of
Residuals vs Fitted Values
Statistical Modeling - 10
US Army Logistics Management College
Multiple Regression
Bendrix Automotive Parts Company
• The Bendrix Company manufactures various types of parts for
automobiles.
• The factory manager wants to get a better understanding of
overhead costs, including supervision, indirect labor, supplies,
payroll taxes, overtime premiums,depreciation, and a number of
miscellaneous items such as insurance, utilities, and janitorial and
maintenance expenses.
• Some of the overhead costs are “fixed” in the sense they do not vary
appreciably with the volume of work being done, whereas others are
“variable” and do vary directly with the volume of work being done. It
is not easy to draw a clear line between the fixed and variable
overhead components.
• The Bendrix manager has tracked total overhead costs for 36
months.
Statistical Modeling - 11
US Army Logistics Management College
Explanatory Variables
Bendrix Automotive Parts Company
Bendrix.xls
• The factory manager collected data on two variables he
believes might be responsible for variations in overhead costs:
MachHrs: number of machine hours used during the month.
ProdRuns: the number of separate production runs during
the month (Bendrix manufactures parts in fairly large
batches called production runs. Between each run there is a
downtime.).
• Each observation (row) corresponds to a single month.
• We need to estimate and interpret the equation for Overhead
when both explanatory variables, MachHrs and ProdRuns, are
included in the regression equation, but because these are time
series variables we should also look out for relationships
between these variables and the Month variable.
Statistical Modeling - 12
US Army Logistics Management College
Multiple Regression with StatTools
Bendrix Automotive Parts Company
• Select StatTools
+ Regression & Classification
+ Regression…
• Check “Overhead” as the
response (dependent) variable.
• Check “MachHrs” and ProdRuns”
as the explanatory (independent)
variables.
• Select the Graph options in the
dialog box as shown here.
Statistical Modeling - 13
US Army Logistics Management College
Multiple Regression Output Table
Bendrix Automotive Parts Company
• The coefficients in B18:B20 indicate that the estimated regression
equation is
Predicted Overhead = 3997 + (43.45 x MachHrs) + (883.62 x ProdRuns)
Statistical Modeling - 14
US Army Logistics Management College
Interpretation of Equation
Bendrix Automotive Parts Company
• If the number of production runs is held constant,
then the overhead cost is expected to increase by
$43.54 for each extra machine hour
• If the number of machine hours is held constant, the
overhead is expected to increase by $883.62 for
each extra production run.
• $3997 is the fixed component of overhead.
• The slope terms involving MachHrs and ProdRuns
are the variable components of overhead.
Statistical Modeling - 15
US Army Logistics Management College
Equation Comparison
Bendrix Automotive Parts Company
•
It is interesting to compare this equation with the separate equations:
Predicted Overhead = 48,621 + 34.70(MachHrs) and
Predicted Overhead = 75,606 + 655.07(ProdRuns)
Predicted Overhead = 3,997 + 43.45 MachHrs + 883.62 ProdRuns
•
Note that both coefficients have increased. Also, the intercept is now lower than
either intercept in the single variable equation. It is difficult to guess the changes
that more explanatory variables will cause, but it is likely that changes will occur.
•
The reasoning for this is that when MachHrs is the only variable in the equation,
we are obviously not holding ProdRuns constant - we are ignoring it - so in effect
the coefficient 34.7 of MachHrs indicates the effect of MachHrs and the omitted
ProdRuns on Overhead.
•
But when we include both variables, the coefficient of 43.5 of MachHrs indicates
the effect of MachHrs only, holding ProdRuns constant.
•
Since the coefficients have different meanings, it is not surprising that we obtain
different estimates.
Statistical Modeling - 16
US Army Logistics Management College
Modeling Possibilities
Fifth National Bank Gender-Discrimination Suit
• The Fifth National Bank of Springfield is facing a genderdiscrimination suit. The charge is that its female
employees receive substantially smaller salaries than its
male employees.
• The bank’s employee database is listed in this file. Here
is a partial list of the data.
Bank.xls
Statistical Modeling - 17
US Army Logistics Management College
Variables
Fifth National Bank Gender-Discrimination Suit
For each of the 208 employees, the variables in the data set are:
• EducLev: education level with
categories 1 (high school grad),
2 (some college), 3 (bachelor’s
degree), 4 (some graduate
courses) & 5 (graduate degree)
• JobGrade: current job level, the
possible levels being from 1-6
(6 is highest)
• YrHired: year employee was
hired
• Salary: current annual salary in
thousands of dollars
• YrBorn: year employee was
born
• Gender: a categorical variable
with values “Female” and “Male”
• YrsPrior: number of years of
work experience at another bank
prior to working at Fifth National
• PCJob: a dummy variable with
value 1 if the employee’s current
job is computer-related and
value 0 otherwise
Do the data provide evidence that females are
discriminated against in terms of salary?
Statistical Modeling - 18
US Army Logistics Management College
Naïve Approach
Fifth National Bank Gender-Discrimination Suit
• A naïve approach to the problem is to compare the
average salaries of the males and females.
• The average of all salaries is $39,922, the average
female salary is $37,210, and the average male salary
is $45,505.
• The difference between the averages is statistically
different. The females are definitely earning less, but
perhaps there is a reason.
• The question is whether the differences between the
average salaries is still evident after taking other
attributes into account. A perfect task for regression.
Statistical Modeling - 19
US Army Logistics Management College
Dummy Variables
Fifth National Bank Gender-Discrimination Suit
• Some potential explanatory variables are categorical and
cannot be measured on a quantitative scale. However, we
often need to use these variables because they are related to
the response variable.
• The trick is to create dummy variables, also called indicator or
0-1 variables, that indicate the category a given observation is
in.
• To create dummy variables we can use an IF statement or we
can use StatTools’ Dummy variable procedure, which is
usually easier particularly when there are multiple categories.
• Once the dummy variables are created, we can combine the
variables if we like by simply adding the columns to get the
dummy for the new category.
Statistical Modeling - 20
US Army Logistics Management College
Regression Analysis w/Dummy Variables
Fifth National Bank Gender-Discrimination Suit
• In this example we create dummy variables for Gender,
and JobGrade. We also create another variable:
YrsExper = 95 – YrHired (since this is 1995 data)
• We must follow two rules:
 We shouldn’t use any of the original categorical
variables that the dummies are based on.
 We should use one less dummy than the number of
categories for any categorical variable.
• Then we can run a regression analysis with Salary as
the response variable, using any combination of
numerical and dummy explanatory variables.
Statistical Modeling - 21
US Army Logistics Management College
Creating Dummy Variables
Gender Categorical Variable
To create a dummy variable called Female for Gender:
•Select any data cell.
•From the Menu bar, select StatTools + Data Utilities
+ Dummy…
•Select “Gender”, as the variable
•Select “Create One Dummy Variable for Each Distinct
Category”.
•Answer “Yes” to warnings.
Repeat the procedure for JobGrade.
Statistical Modeling - 22
US Army Logistics Management College
Regression Analysis
Gender Only
• We first estimate a regression equation with Female as the only
variable. The resulting equation is:
Predicted Salary = 45.505 - 8.296Female
• To interpret this equation recall that Female has only two
possible values, 0 and 1. If we substitute 1 then the predicted
salary equals 37.209 and if we substitute 0 the predicated
salary is 45.505.
• These are the average salaries of females and males.
Therefore the interpretation of the -8.2955 coefficient of the
Female dummy variable is straightforward.
• The above equation only tells part of the story, it ignores all
information except for gender.
Statistical Modeling - 23
US Army Logistics Management College
Regression Analysis
Gender + YrsExper + YrsPrior
• We expand this equation by adding YrsExper and YrsPrior.
• The corresponding equation is:
Pred Salary = 35.492 + 0.988YrsExper + 0.131YrsPrior - 8.080Female
• It is useful to write two separate equations, one for females:
Predicted Salary = 27.412 + 0.988YrsExper + 0.131YrsPrior
and one for males:
Predicted Salary = 35.492 + 0.988YrsExper + 0.131YrsPrior
• We interpret the coefficient -8.080 of the Female dummy
variable as the average salary disadvantage for females relative
to males after controlling for job experience. But there is still
more story to tell.
Statistical Modeling - 24
US Army Logistics Management College
Regression Analysis
Gender + YrsExper + YrsPrior + JobGrade
• We next add job grade to the equation by including five of the
six job grade dummies. Although any five can be use we use
Job_2 - Job_6.
• The estimated regression equations is now:
Predicted Salary = 30.230 + 0.408YrsExper
+ 0.149YrsPrior - 1.962Female + 2.575Job_2
+ 6.295Job_3 + 10.475Job_4 +16.011Job_5 + 27.647Job_6
• There are now two categorical variables involved, gender and
job grade. However, we can still write a separate equation for
any combination of categories by setting the dummies to the
appropriate values.
Statistical Modeling - 25
US Army Logistics Management College
Interpretation
Gender + YrsExper + YrsPrior + JobGrade
• The equation for females at the fifth job grade is found by
setting Female=1, Job_5=1, & other job dummies equal to 0.
PredictedSalary = 44.279 + 0.408YrsExper + 0.149YrsPrior
• The expected salary increase for one extra year of experience
is $408; the expected salary increase for one year experience
with another bank is $149 (either gender and any job grade).
• The coefficients of the job dummies indicate the average
increase in salary an employee can expect relative to the
reference (lowest) job grade.
• The key coefficient, the negative $1962 for females indicates the
average salary disadvantage for females relative to males, given
that they have the same experience levels and are in the same
job grade
• The “penalty” is less than a fourth of the penalty we saw
before. It appears that females might be getting paid less on
average partly because they are in the lower job categories.
Statistical Modeling - 26
US Army Logistics Management College
Pivot Table
Concentration of Females in Lower Paid Jobs
• We can use a pivot table to check whether females are
disproportionately in the lower job categories (set JobGrade in
the row area, Gender in the column area and the count
(expressed as a percentage) of any variable in the data area).
• Clearly, females tend to be concentrated at the lower job
grades.
• This helps explain why females get lower salaries on average,
but doesn’t explain why females are at the lower job grades in
the first place.
• We won’t be able to
provide a thorough
analysis of this issue.
Statistical Modeling - 27
US Army Logistics Management College
Conclusion
• The main conclusion we can draw from the output
is that there is still a plausible case to be made for
discrimination against females, even after including
information on all the variables in the database in
the regression equation.
Statistical Modeling - 28
US Army Logistics Management College
Interaction Terms
Fifth National Bank Gender-Discrimination Suit
• An interaction variable algebraically is the product of two variables.
Its effect is to allow the effect of one of the variables on Y to depend
on the value of the other variable.
• The interaction term allows the slope of the regression line to differ
between the two categories.
• Earlier we estimated an equation for Salary using the numerical
explanatory variables YrsExper and YrsPrior and the dummy
variable Female.
• If we drop the YrsPrior variable from the equation (for simplicity) and
rerun the regression, we obtain the equation
Predicted Salary = 35.824 + 0.981YrsExper - 8.012Female
• The R2 value for this equation is 49.1%. If we decide to include an
interaction variable between YrsExper and Female in this equation,
what is the effect?
Statistical Modeling - 29
US Army Logistics Management College
Solution with Interaction Terms
Fifth National Bank Gender-Discrimination Suit
• We first need to form an interaction variable that is the
product of YrsExper and Female.
• This can be done two ways in Excel.
• Do it manually by introducing a new variable that contains the
product of the two variables involved, or
• Use: StatTools + Data Utilities + Interaction…
• Using the latter way we must select Female and YrsExper as
the variables.
• Once the interaction variable has been created, we include it
in the regression equation in addition to the other variables.
Statistical Modeling - 30
US Army Logistics Management College
Interpretation w/ Interaction Terms
Fifth National Bank Gender-Discrimination Suit
• The estimated regression equation is
Predicted Salary = 30.430 + 1.528YrsExper
+ 4.098Female - 1.248YrsExper_Female
• The female equation is: Pred Salary = 34.528 + 0.280YrsExper
& the male equation is: Pred Salary = 30.430 + 1.528YrsExper
• Graphically - Nonparallel Female and Male Salary Lines
Statistical Modeling - 31
US Army Logistics Management College
Conclusion w/Interaction Terms
Fifth National Bank Gender-Discrimination Suit
• The Y-intercept for the female line is slightly higher females with no experience at Fifth National Bank tend to
start out slightly higher than males - but the slope of the
female line is much lower. That is, males tend to move up
the salary ladder much more quickly than females.
• Again, this provides another argument, although a
somewhat different one, for gender discrimination against
females.
• The R2 value increased from 49.1% to 63.9%. The
interaction variable has definitely added to the
explanatory power of the equation.
Statistical Modeling - 32
US Army Logistics Management College
Part 2: Regression Analysis
Statistical Inference
Statistical Modeling - 33
US Army Logistics Management College
Inference About Regression Coefficients
Bendrix Automotive Parts Company
Bendrix1.xls
• As before, the response variable is Overhead and the explanatory
variables are MachHrs and ProdRuns.
• What inferences can we make about the regression coefficients?
• We obtain the output from using StatTools
Statistical Modeling - 34
US Army Logistics Management College
Multiple Regression Output
Bendrix Automotive Parts Company
Predicted Overhead = 3997 + 43.54MachHrs + 883.62ProdRuns
• Regression coefficients estimate the true, but
unobservable, population coefficients.
• The standard error of bi indicates the accuracy of these
point estimates.
• For example, the effect on Overhead of a one-unit
increase in MachHrs is 43.536. We are 95% confident
that the coefficient is between 36.234 to 50.839. Similar
statements can be made for the coefficient of ProdRuns
and the intercept term.
Statistical Modeling - 35
US Army Logistics Management College
A Test for the Overall Fit:
The ANOVA Table
Bendrix Automotive Parts Company
• Does the ANOVA table for the Bendrix manufacturing data
indicate that the combination MachHrs and ProdRuns has at
least some ability to explain variation in Overhead?
• The F-ratio is “off the charts” and the p-value is practically 0.
Statistical Modeling - 36
US Army Logistics Management College
Interpretation of the ANOVA Table
Bendrix Automotive Parts Company
• This information wouldn’t be much comfort for the
Bendrix manager who is trying to understand the causes
of variation in overhead costs.
• This manager already knows that machine hours and
production runs are related positively to overhead costs everyone in the company knows that!
• What he really wants to know is a set of explanatory
variables that yields a high R2 and a low se.
• The low p-value in the ANOVA table does not guarantee
these. All it guarantees is that MachHrs and ProdRuns
are of “some help” in explaining variation in Overhead.
Statistical Modeling - 37
US Army Logistics Management College
Violations of Regression Assumptions
Bendrix Automotive Parts Company
• Is there evidence of non constant variance?
• Is there any evidence of lag 1 autocorrelation
in the Bendrix data when Overhead is
regressed on MachHrs and ProdRuns?
• Is there evidence of non Normality?
Statistical Modeling - 38
US Army Logistics Management College
Do the Residuals Have
Constant Variance?
Bendrix Automotive Parts Company
• If the residual variance is
not constant, the
standard error of the
regression coefficient,
s(bi), is incorrect.
• Note: when we ran the
regression we selected
“Residuals vs Fitted
Values” graphs.
Statistical Modeling - 39
US Army Logistics Management College
Plot of Residuals vs Fitted Values
Bendrix Automotive Parts Company
• Residuals appear to have equal Variances (homoscedasticity)
Scatterplot of Residual vs Fit
8000.0
6000.0
4000.0
Residual
2000.0
0.0
-2000.0
-4000.0
-6000.0
-8000.0
-10000.0
-12000.0
75000.0 80000.0 85000.0 90000.0 95000.0 100000. 105000. 110000. 115000. 120000.
0
0
0
0
0
Fit
Statistical Modeling - 40
US Army Logistics Management College
Autocorrelated Residuals
Bendrix Automotive Parts Company
• The residuals of time series data are often autocorrelated.
The most frequent type of autocorrelation is positive
autocorrelation. For example, if residuals separated by 1
month are auto correlated, this is called lag 1 autocorrelation.
• We use the fitted (col C) and residuals values (col D) In the
“Regression” tab. The residuals represent how much the
regression over-predicts (if negative) or under-predicts (if
positive) the overhead cost for that month.
Statistical Modeling - 41
US Army Logistics Management College
Durbin-Watson Test
Bendrix Automotive Parts Company
• We can check for lag 1 autocorrelation in two ways, with the DurbinWatson(DW) statistic and by examining the time series graph of the
residuals.
• The Durbin-Watson (DW) statistic is scaled between 0 and 4.
•
2 - little lag 1 autocorrelation
•
< 2 - positive autocorrelation
•
> 2 – negative autocorrelation.
•
If n = 30 and bi’s 1-5, <1.2 is a problem)
• We calculate the DW statistics in cell E45 with the formula:
=StatDurbinWatson(D45:D80)
Based on our guidelines for DW value 1.3131 suggests positive
autocorrelation - it is less than 2 - but not enough to cause concern.
Statistical Modeling - 42
US Army Logistics Management College
Time Series Graph of Residuals
Bendrix Automotive Parts Company
• This general conclusion is supported by the time series graph.
Time Series of Residuals / Data Set #2
− Add the range A44:D80
8000
as a
6000
Data set
4000
2000
− StatTools
0
-2000
+ Time Series &
-4000
Forecasting
-6000
-8000
+ Time Series
-10000
-12000
Graph
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35
Observation #
− Select Residuals as the
variable
• Serious autocorrelation of lag 1 would tend to show long runs of residuals
alternating above and below the horizontal axis - positives would tend to
follow positives and negatives would tend to follow negatives. There is some
indication of this in the graph but not an excessive amount.
Statistical Modeling - 43
US Army Logistics Management College
Are the Residuals
Normally Distributed?
Bendrix Automotive Parts Company
• The Inferences we want to make assume the residuals
are normally distributed.
• Using Data Set #2
• Select: StatTools + Normality Tests
+ Q-Q Normal Plot
• Select “Residuals” as the variable
• Check “Plot Using Standardized Q-Values”
and “Include Reference Line”
Statistical Modeling - 44
US Army Logistics Management College
Normal Probability Plot
Bendrix Automotive Parts Company
• Error terms appear to be Normally Distributed
Q-Q Normal Plot of Residuals / Data Set #2
3.5
Standardized Q-Value
2.5
1.5
0.5
-3.5
-2.5
-1.5
-0.5
-0.5
0.5
1.5
2.5
3.5
-1.5
-2.5
-3.5
Z-Value
Statistical Modeling - 45
US Army Logistics Management College
Multicollinearity
Height vs Left & Right Feet
• The relationship between the explanatory variable X and the
response variable Y is not always accurately reflected in the
coefficient of X; it depends on which other X’s are included or not
included in the equation (especially when there is a linear relationship
between two or more explanatory variables, in which case we have
multicollinearity).
• Multicollinearity is the presence of a fairly strong linear relationship
between two or more explanatory variables, and it can make
estimation difficult.
• We want to explain a person’s height by means of foot length. The
response variable is Height, and the explanatory variables are Right
and Left, the length of the right foot and the left foot, respectively.
• It is likely that there is a large correlation between height and foot
size, so we would expect this regression equation to do a good job.
The R2 value will probably be large. But what about the coefficients of
Right and Left?
Statistical Modeling - 46
US Army Logistics Management College
Correlation of Left & Right
Height vs Left & Right Feet
Height.xls
• To show what can happen numerically, we generated a
hypothetical data set of heights and left and right foot lengths in
this file.
• We did this so that, except for random error, height is
approximately 32 plus 3.2 times foot length (in inches).
StatTools + Summary Statistics + Correlation & Covariance
The correlations between Height and either Right or Left in our data
set are quite large, and the correlation between Right and Left is very
close to 1.
Statistical Modeling - 47
US Army Logistics Management College
Multiple Regression
Height vs Left & Right Feet
• The Regression output tells a somewhat confusing story.
• The multiple R and the corresponding R2 are about what we
would expect, given the correlations between Height and either
Right or Left.
• In particular, the multiple R is close to the correlation between
Height and either Right or Left. Also, the se value is quite good.
It implies that predictions of height from this regression
equation will typically be off by only about 2 inches.
• However, the coefficients of Right and Left are not all what we
might expect, given that we generated heights as
approximately 32 plus 3.2 times foot length.
• In fact, the coefficient of Left has the wrong sign - it is negative!
• Besides this wrong sign, the tip-off that there is a problem is
that the t-value of Left is quite small and the corresponding pvalue is quite large.
Statistical Modeling - 48
US Army Logistics Management College
Solution
• Judging by this, we might conclude that Height and
Left are either not related or are related negatively.
But we know from the table of correlations that both
of these are false.
• In contrast, the coefficient of Right has the “correct”
sign, and its t-value and associated p-value do imply
statistical significance, at least at the 5% level.
• However, this happened mostly by chance, slight
changes in the data could change the results
completely.
Statistical Modeling - 49
US Army Logistics Management College
Solution
• Although both Right and Left are clearly related to Height, it is
impossible for the least squares method to distinguish their separate
effects.
• Note that the sum of the coefficients is 3.178 which is close to the
coefficient of 3.2 we used to generate the data. Therefore, the
estimated equation will work well for predicting heights, but does not
provide reliable estimates of the coefficients of Right and Left.
• When Right is only variable: Predicted Height = 31.546 + 3.195Right
• The R2 = 81.6%, se = 2.005, the t-value = 21.34 and p-value = 0.000
for the coefficient of Right - very significant.
• When Left is only variable: Predicted Height = 31.526 + 3.197Left
• The R2 = 81.1%, and se = 2.033, the t-value = 20.99, and the p-value =
0.0000 for the coefficient of Left - again very significant.
• Clearly, both of these equations tell almost identical stories, and they
are much easier to interpret than the equation with both Right and Left
included.
Statistical Modeling - 50
US Army Logistics Management College
Stepwise Regression
HyTex Catalogs
• HyTex is a direct marketer of stereo equipment, personal
computers, and other electronic products. HyTex
advertises entirely by mailing catalogs to its customers,
and all of its orders are taken over the telephone.
• The company spends a great deal of money on its
catalog mailings, and it wants to be sure that this is
paying off in sales. Data on 250 customers who
purchased mail-order products from the HyTex Company
in 1998 is available.
• Stepwise regression will be used to produce a
regression equation for the amount spent in 1998.
Statistical Modeling - 51
US Army Logistics Management College
The Data
HyTex Catalogs
For each customer there are data on the following variables:
• Age: (1 = 30 or younger, 2 = 31
to 55, 3 for 56 and older)
• Salary: combined annual salary of
customer and spouse (if any)
• Gender: (1 = males, 0
=females
• Children: number of children
living with customer
• OwnHome: (1 = customer
owns home, 0 otherwise)
• Customer97: (1 = customer
purchased from HyTex during
1997, 0 otherwise)
• Married: (1 = customer is
currently married, 0 otherwise)
• Close: (1 = customers lives
reasonably close to shopping
area that sells similar
merchandise, 2 otherwise)
• Spent97: total amount of
purchase in 1997 from HyTex
• Catalogs: Number of catalogs
sent to the customer in 1998
• Spent98: total amount of
purchase in 1998 from HyTex
Statistical Modeling - 52
US Army Logistics Management College
Stepwise Regression
• Many statistical packages provide some assistance by including
automatic equation-building options. These options estimate a series
of regression equations by successively adding (or deleting) variables
according to prescribed rules.
• Generically, these methods are referred to as stepwise regression.
• There are three types: forward, backward and stepwise.
 Forward - begins with no explanatory variables in the equation
and successively adds one at a time until no explanatory variables
make a significant contribution.
 Backward - begins with all potential explanatory variables in the
equation and deletes them one at a time until further deletion
would do more harm than good.
 Stepwise - much like a forward procedure, except that it also
considers possible deletions along the way.
Statistical Modeling - 53
US Army Logistics Management College
Stepwise Regression in StatTools
HyTex Catalogs
• Select StatTools + Regression
& Classification + Regression
• Select Regression Type:
Stepwise.
• Specify Spent98 as the
response variable and select all
of the other variables (besides
Customer) as potential
explanatory variables.
• Choose p-values or F-values as
the appropriate criterion.
Statistical Modeling - 54
US Army Logistics Management College
Interpretation of Final Regression Equation
• The coefficient of Catalogs implies that $42.00 more was spent for each
catalog sent.
• The coefficient of Married implies that $330.44 more was spent for every
married person.
• The coefficient of Own Home implies that $206.28 more was spent for every
person owning their own home.
• The coefficients for Spent97 and Customer97 are somewhat more difficult to
interpret. First, both are 0 for customers who didn’t purchase the previous
year. For those who did, the terms become -1,117.95 + 0.93Spent97.
Statistical Modeling - 55
US Army Logistics Management College
The Partial F Test
Statistical Modeling - 56
US Army Logistics Management College
The Partial F Test
Fifth National Bank Gender-Discrimination Suit
• The Fifth National Bank is facing a gender-discrimination suit
charging that its female employees receive substantially smaller
salaries than its male employees.
• Previously we ran several regressions for Salary to see whether
there is convincing evidence of salary discrimination against
females.
• Now, we will perform the following analysis:
• We will regress Salary versus the Gender_Female, Yrs_Exper, and
Yrs_Exper*Gender_Female_1. This will be the reduced equation.
• Then we’ll see whether the variables JobGrade_2 through JobGrade_6
add anything significant to the reduced equation.
• Next see if the variables Gender_Female_1*JobGrade_2_1 through
Gender_Female_1*JobGrade_6_1 add anything significant to what we
already have.
• Continuing on, see if EducLev_1 through EducLev_5 add anything
significant to what we already have.
Statistical Modeling - 57
US Army Logistics Management College
First Solution
Fifth National Bank Gender-Discrimination Suit
• First, note that we created all of the dummies and interaction variables
with StatTools’ Data Utilities procedures.
• Also, note that we have used three sets of dummies, for gender, job
grad and education level. When we use these in a regression equation,
the dummy for one category of each should always be excluded; it is
the reference category. The reference categories we have used are
“male”, job grade 1 and education level 1.
• The “smallest” equation uses Gender_Female, Yrs_Exper, and
Yrs_Exper*Gender_Female_1 as explanatory variables.
• We’re off to a good start. These three variables already explain 63.9%
of the variation of Salary.
• The next equation adds the explanatory variables JobGrade_2 through
JobGrade_6.
Statistical Modeling - 58
US Army Logistics Management College
Second Solution
Fifth National Bank Gender-Discrimination Suit
•
This equation appears much better. ( R2 increased to 81.1%). Check whether
it is significantly better with the partial F test.
F  ratio 
 SSE R
 SSE
MSE
C
 / k  j 
C
•
Calculate the F–ratio. Given SSER = 9478.232, SSEC = 4958.368,
MSEC = 24.916 , k – j = 8 – 3 = 5 (represents the number of extra variables)
the F–ratio is 36.28
•
Calculate the corresponding p-value. Using Excel, the formula is:
“=FDIST(x, dof1, dof2)” where x is the result of the partial F Test (above),
dof1 is the number of additional variables (k – j), and dof2 is the degrees of
freedom for the unexplained complete equation.
Since FDIST(36.28,5,199) = 0, there is no doubt the added variables
contribute to the explanatory power of the equation.
Statistical Modeling - 59
US Army Logistics Management College
Third Solution
Fifth National Bank Gender-Discrimination Suit
•
This equation appears better. ( R2 increased to 84%). Check whether it is
significantly better with the partial F test.
F  ratio 
 SSE R
 SSE
MSE
C
 / k  j 
C
•
Calculate the F–ratio. Given SSER = 4958.368, SSEC = 4206.345,
MSEC = 21.682 , k – j = 13 – 8 = 5 the F–ratio is 6.9368
•
Calculate the corresponding p-value. Using Excel, the formula is:
“=FDIST(x, dof1, dof2)” where x is the result of the partial F Test (above),
dof1 is the number of additional variables (k – j), and dof2 is the degrees of
freedom for the unexplained complete equation.
Since FDIST(6.9368,5,194) = 0, there is no doubt the added variables
contribute to the explanatory power of the equation.
Statistical Modeling - 60
US Army Logistics Management College
Fourth Solution
Fifth National Bank Gender-Discrimination Suit
•
This equation seems very slightly better. ( R2 increased to 84.7%). Check
whether it is significantly better with the partial F test.
F  ratio 
 SSE R
 SSE
MSE
C
 / k  j 
C
•
Calculate the F–ratio. Given SSER = 4206.345, SSEC = 4005.418,
MSEC = 21.081 , k – j = 17 – 13 = 4 the F–ratio is 2.383
•
Calculate the corresponding p-value. Using Excel, the formula is:
“=FDIST(x, dof1, dof2)” where x is the result of the partial F Test (above),
dof1 is the number of additional variables (k – j), and dof2 is the degrees of
freedom for the unexplained complete equation.
Since FDIST(2.383,4,190) = 0.0529, we can not be 95% confident the added
variables contribute to the explanatory power of the equation. We therefore
choose not to include them in the model.
Statistical Modeling - 61
US Army Logistics Management College
Solution
Fifth National Bank Gender-Discrimination Suit
• According to the partial F test, the variables added to the
forth equation do not improve the solution enough to
qualify for statistical significance at the 5% level.
• Based on this evidence, there is not much to gain from
including the education dummies in the equation, so we
would probably elect to exclude them.
• As a result, the third solution is considered the complete
solution.
Statistical Modeling - 62
US Army Logistics Management College
Concluding Comments
Fifth National Bank Gender-Discrimination Suit
• The partial test is the formal test of significance for an
extra set of variables. Many users look only at the R2
and/or se values to check whether extra variables are
doing a “good job”.
• If the partial F test shows that a block of variables is
significant, it does not imply that each variable in this
block is significant. Some added variables can have low
t-values.
Statistical Modeling - 63
US Army Logistics Management College
Concluding Comments
Fifth National Bank Gender-Discrimination Suit
• Producing all of these outputs and
completing the partial F Test is a lot
of work.
• StatTools includes a routine called
“Block” that simplifies the process.
• Select StatTools + Regression &
Classification + Regression
• Select Regression Type: Block.
• Choose 4 blocks and identify which
additional variables enter each block.
Statistical Modeling - 64
US Army Logistics Management College
Concluding Comments
• While we have concentrated on the partial F test and
statistical significance in this example, don’t lose sight
of the bigger picture. Once we have decided on a “final”
regression equation we need to analyze its implications
for the problem at hand.
• In this case the bank is interested in possible salary
discrimination against females, so we should interpret
this final equation in these terms. Don’t get so caught
up in the details of statistical significance that you lose
sight of the original purpose of the analysis!
Statistical Modeling - 65
US Army Logistics Management College
Outliers
Fifth National Bank Gender-Discrimination Suit
• Are there any obvious outliers from the 208 employees?
• In what sense are they outliers?
• Does it matter to the regression results, particularly those concerning
gender discrimination, whether the outliers are removed?
• There are several places we could look for outliers. An obvious place
is the Salary variable.
• The boxplot shown here shows that there are several employees
making substantially more in salary than most of the employees.
Statistical Modeling - 66
US Army Logistics Management College
Solution
• We could consider these outliers and remove them, arguing
perhaps that these are senior managers who shouldn’t be included
in the discrimination analysis. We leave it to you to check whether
the regression results are any different with these high salary
employees than without them.
• Another place to look is at the scatterplot of the residuals versus the
fitted values. This type of plot shows points with abnormally large
residuals. For example, we ran the Regression with Female,
YrsExper, Fem-YrsExper and the five job grade dummies, and we
obtained the output and scatterplot shown here.
Statistical Modeling - 67
US Army Logistics Management College
Solution
• This scatterplot has several points that could be
considered outliers, but we focus on the point identified
in the figure.
• The residual for this point is approximately -21.
• Given the se for this regression is approximately 5, this
residual is over four standard errors below 0 - quite a
lot.
• This person is found to be unusual and special
circumstances can explain this.
• We delete this employee and rerun the regression with
the same variables
Statistical Modeling - 68
US Army Logistics Management College
Solution
• Recalling that gender discrimination is the key issue in this
example we compare the coefficients of Female and
Fem_YrsExper in the two outputs.
• The coefficient of Female has dropped from 6.063 to 4.353. In
words, the Y-intercept for the female regression line used to
be about $6000 higher than for the male line, now it’s only
about $4350.
• More importantly, the coefficient of Fem_YrsExper has
changed from -1.021 to -0.721. This indicates how much less
steep the female line for Salary versus Yrs_Exper is than the
male line.
• A change from -1.021 to -0.721 indicates less discrimination
against females now than before. This unusual female
employee accounts for a good bit of the discrimination
argument - although a strong argument still exists even
without her.
Statistical Modeling - 69
US Army Logistics Management College
Part 3: Analysis of Variance
and Experimental Design
Statistical Modeling - 70
US Army Logistics Management College
One-Way ANOVA
Statistical Modeling - 71
US Army Logistics Management College
One-way ANOVA
• A one-way analysis variance or one-way ANOVA is
the procedure for analyzing the differences between
more than two population means. A one-way ANOVA
is also used in randomized experiments where a
single population is treated in one of several ways.
• The data analysis in these two situations is identical;
only the interpretation of the results differ.
Statistical Modeling - 72
US Army Logistics Management College
One-way ANOVA Process
• The one-way ANOVA procedure is usually run in two
stages.
• The first stage tests the null hypothesis.
• If the p-value is not sufficiently small, then there
is not enough evidence to reject the equal-means
hypothesis, and the analysis stops.
• If the p-value is sufficiently small, we can
conclude with some assurance that the means
are not all equal.
• If all means are not equal, the second stage
determines which of the groups differ significantly from
the others.
Statistical Modeling - 73
US Army Logistics Management College
Background Information
Effects of Shelf Height on Cereal Sales at Midway
• Midway Company selects 125 stores in its chain of
supermarkets to conduct an experiment on cereal sales.
• These stores are similar in terms of store size, customer traffic,
customer types, and other characteristics.
• Each store stocks cereal in a similar location in the store on fiveshelf displays
• The 125 stores are randomly selected to be in one of five
groups, where each group stocks Brand X cereal in a specific
shelf location (highest, next highest, middle, next lowest, lowest)
• The number of Brand X boxes sold at each store are recorded
for the last two weeks of the experiment (the first two weeks
allow customers to get used to the shelving positions)
• Objective: does shelf height make any difference in mean sales
of Brand X cereal, and if so, which shelf heights outperform the
others.
Statistical Modeling - 74
US Army Logistics Management College
One-way ANOVA Solution
Cereal Sales at Midway
• Note that this is a designed experiment
• Initial stores chosen in an attempt to control for extraneous
factors
• Randomly assigned stores to treatment levels (shelf heights)
• The output consists of three basic parts:
• summary statistics
• the ANOVA table
• confidence intervals
• Select Statistical Inference + One Way ANOVA
• The next slide contains this output.
Statistical Modeling - 75
US Army Logistics Management College
One-way ANOVA Solution
Cereal Sales at Midway
Results of one-way ANOVA
Summary stats for samples
Sample sizes
Sample means
Sample standard deviations
Sample variances
Weights for pooled variance
Lowest Next-to-lowest
25
25
334.920
378.680
61.043
84.081
3726.243
7069.560
0.200
0.200
Number of samples
Total sample size
Grand mean
Pooled variance
Pooled standard deviation
5
125
381.440
5719.193
75.625
OneWay ANOVA table
Source
Between variation
Within variation
Total variation
Middle Next-to-highest
25
25
383.440
426.280
75.625
85.054
5719.173
7234.210
0.200
0.200
SS
104807.680
686303.120
791110.800
df
4
120
124
MS
26201.920
5719.193
F
4.581
Confidence intervals for mean differences
Confidence level
95.0%
Tukey method
Difference
Mean diff
Lowest - Next-to-lowest
-43.760
Lowest - Middle
-48.520
Lowest - Next-to-highest
-91.360
Lowest - Highest
-48.960
Next-to-lowest - Middle
-4.760
Next-to-lowest - Next-to-highest
-47.600
Next-to-lowest - Highest
-5.200
Middle - Next-to-highest
-42.840
Middle - Highest
-0.440
Next-to-highest - Highest
42.400
Lower
-103.050
-107.810
-150.650
-108.250
-64.050
-106.890
-64.490
-102.130
-59.730
-16.890
Upper
15.530
10.770
-32.070
10.330
54.530
11.690
54.090
16.450
58.850
101.690
Signif?
No
No
Yes
No
No
No
No
No
No
No
Highest
25
383.880
69.619
4846.777
0.200
p-value
0.0018
Statistical Modeling - 76
US Army Logistics Management College
Summary Statistics
Cereal Sales at Midway
• The summary statistics show that the next to highest
shelf position has the largest mean store sales
(426.28), and the lowest shelf has the smallest mean
store sales (334.92), with the others in between.
• The sample standard deviations (or variances) vary
somewhat across the shelf positions, but not enough
to invalidate the procedure (we assume equal
variance).
• The side-by-side boxplots in the figure on the next
slide illustrate these summary measures graphically.
However, there is too much overlap to tell whether
the differences are statistically significant.
Statistical Modeling - 77
US Army Logistics Management College
Boxplot of Mean Results by Region
Cereal Sales at Midway
Highest
Next-to-highest
Middle
Next-to-low est
Low est
150
225
300
375
450
525
600
Statistical Modeling - 78
US Army Logistics Management College
ANOVA Table Results
Cereal Sales at Midway
• The Total variation in the ANOVA Table is based on the total variation
of all observations around the grand mean in the summary section,
and is used mainly to aid in calculations.
• The grand mean is the sample mean of all observations.
• The between variation is the squared difference between the
treatment level means and the grand mean weighted by the treatment
sample sizes (df = number of groups – 1)
• The within variation is variation due to differences within individual
treatment groups (df = total sample size - # groups)
• The F-ratio for the test is 4.581 with a corresponding p-value of
0.0018 (since < .05, we reject the null hypothesis that all means are
equal).
• Since all means are not equal, we proceed to a comparison test to
determine which means are not equal
Statistical Modeling - 79
US Army Logistics Management College
Results
Cereal Sales at Midway
• The final section of output lists a set of multiple comparison of two
treatment levels (shelf heights).
• The difference shows which two shelf heights are being compared,
and the mean difference shows how much difference there is
between the mean sales for the two shelf heights
• The lower and upper level shows the confidence intervals for the two
shelf heights – if the lower value is negative and the upper value is
positive, then 0 is contained in the interval and we can conclude that
there is no statistical difference in sales between those two heights
• The only statistically significant difference we can discern is
between the next to highest shelf and the lowest shelf (largest and
smallest mean sales)
• The company needs to discern if that difference is practically
significant, or if any external factors confounded the experiment.
Statistical Modeling - 80
US Army Logistics Management College
Two-Way ANOVA
Statistical Modeling - 81
US Army Logistics Management College
Background Information
Golf Ball Testing
• Many golf ball manufacturers claim to have the “longest ball,”
that is, the ball that goes the farthest on drives.
• This example illustrates how these claims might be tested by
testing five major brands (Brand A through E)
• A consumer testing service runs an experiment where 60 balls
of each brand are driven under three temperature conditions.
• The first 20 are driven in cool weather, the next 20 are driven in
mild weather, and the last 20 are driven in warm weather.
• The goal is to see whether some brands differ significantly, on
average, from other brands and what effect temperature has on
the mean differences between brands.
Statistical Modeling - 82
US Army Logistics Management College
Experimental Design
Golf Ball Testing
• Unlike the last example, this example represents a controlled
experiment (20 golf balls of each brand are randomly assigned
to each of three temperature levels).
• In general terminology, the experimental units are the
individual golf balls and the response variable is the length (in
yards) of each drive.
• There are two factors (brand and temperature), each with
different treatment levels (brand has levels A through E, and
temperature has three levels: cool, mild, and warm).
• The design is balanced because the same number of balls, 20,
is used at each of the 5 x 3 = 15 treatment level combinations.
• There is one further piece of terminology. We call this a full
factorial two-way design because we test golf balls at each of
the 15 possible treatment level combinations.
Statistical Modeling - 83
US Army Logistics Management College
Conducting the Experiment
Golf Ball Testing
• How should the consumer testing service carry out the
experiment?
• One possibility is to have 15 golfers, each of approximately the
same skill level, hit 20 balls each. The downside of this design
could be that the golfers assigned to a certain brand could be
having a good day.
• Golfers could be spread out (each golfer could hit 2 balls). This,
however, introduces an unwanted source of variation: the different
abilities of the golfers.
• You could use the same golfer for 300 balls. Unfortunately, the
golfer might get tired in the process of hitting this many balls.
• These are the type of things designers of experiments
must consider.
Statistical Modeling - 84
US Army Logistics Management College
Conducting the Experiment
Golf Ball Testing
• The design should attempt to eliminate as many
unwanted sources of variation as possible, so that
any difference across the factor levels of interest
can be attributed to these factors and not to
extraneous factors.
• In this example, we suspect the best solution is to
employ a “mechanical” golf ball driver to hit all 300
balls.
• This should reduce the inevitable random variation
that would occur by using human golfers.
Statistical Modeling - 85
US Army Logistics Management College
Coding the data
Golf Ball Testing
• Although many rows in the figure are
hidden, there are actually 300 rows of
data, 20 for each of the 15
combinations of Brand and Temp.
• There must be two “code” variables
that represent the levels of the two
factors and a measurement variable
that represents the response variable.
• Again this is a balanced design, which
is what StatTools expects for its twoway ANOVA procedure.
Statistical Modeling - 86
US Army Logistics Management College
Analysis of Results
Golf Ball Testing
Prompted by the table, here are some questions we might ask:
1. Look at column I. Do any brands average significantly more yards
than any others (where these averages are averages over all
temperatures)?
2. Look at the bottom row. Do average yardages differ significantly
across temperatures (where these averages are across all brands)?
3. Look at the middle of the table. Do differences among averages of
brands depend on temperature? For example, does one brand
dominate in cool weather and another in warm weather? Also, do
differences among averages
of temperatures depend on
brand? For example, are
some brands very sensitive
to changes in temperature
while others are not?
Statistical Modeling - 87
US Army Logistics Management College
Analysis of Results
Golf Ball Testing
• It is useful to characterize the type of information these questions are
seeking.
• Question 1 is asking about the main effect of the brand factor. If
we ignore the temperature, do some brands tend to go farther
than some others?
• Question 2 is also asking about a main effect, the main effect of
the temperature factor. If we ignore the brand, do balls tend to go
farther in some temperatures than others? (This answer is
obvious to golfers: balls compress better and go farther in warm
temperatures.) Therefore this is not a key question, although we
would expect the study to confirm what common sense tells us.
• Question 3 is asking about interactions between the two factors.
These interactions are often the most interesting results of a twoway study. In this example interactions are patterns of the
averages that could not be guessed by looking only at the “main
effect” averages.
Statistical Modeling - 88
US Army Logistics Management College
Interaction Effects
Golf Ball Testing
• Specifically, the order of brands in column F, from
largest to smallest average yardages, is E, C, B, A, D.
If there were not interactions at all, this ordering would
hold at each temperature. For these data it is close.
• At cool temperatures the ordering is C, E, B, A, D; for
mild, it is E, B, C, D, A; for warm, it is E, C, A, B, D.
• Actually, having no interaction implies even more than
the preservation of these rankings.
Statistical Modeling - 89
US Army Logistics Management College
Interaction Effects
Golf Ball Testing
• It implies that the difference between any two brand
averages is the same at any of the three temperature
levels.
• For example, the difference between brands E and D
at the three temperature levels are:
224.8 - 215.0 = 9.8
255.7 - 237.6 = 18.1
270.9 - 256.1 = 14.8
• If there were no interactions at all, these three
differences would be equal.
Statistical Modeling - 90
US Army Logistics Management College
Interaction Graphically
Golf Ball Testing
• The concept of interaction is much easier to
understand by looking at graphs.
• The following graphs, which are both outputs from
StatTools’ two-way ANOVA procedure, represent two
ways of looking at the pattern of averages for
different combinations of brand and temperature.
• The first graph shows a line for each brand, where
each point on the line corresponds to a different
temperature. The second shows the same
information with the roles of brand and temperature
reversed.
Statistical Modeling - 91
US Army Logistics Management College
Interaction Graphically
Golf Ball Testing
Interaction Plot: Brand by Temp
270.00
260.00
250.00
A
240.00
B
230.00
C
220.00
D
E
210.00
200.00
Cool
Mild
Warm
Statistical Modeling - 92
US Army Logistics Management College
Interaction Graphically
Golf Ball Testing
280.00
Interaction Plot: Temp by Brand
270.00
260.00
250.00
240.00
Cool
230.00
Mild
Warm
220.00
210.00
200.00
A
B
C
D
E
Statistical Modeling - 93
US Army Logistics Management College
Interaction Graphically
Golf Ball Testing
• Neither graph is better than the other, they simply show
the same data from different perspectives.
• The key to either is whether the lines are parallel. If they
are, then there is no interactions - the effect of one factor
on average yardage is the same regardless of the level of
the other factor. The more nonparallel they are, however,
the stronger the interactions are.
• The lines in either of these graphs are not exactly parallel
but they are nearly so. This implies that there is very little
interaction between brand and temperature.
Statistical Modeling - 94
US Army Logistics Management College
Type of Interactions
• In general, interactions can be of several types.
• Shown here are two contrasting types. These
graphs focus on two types and on different data
than in GOLFBALLS.XLS.
• In the first graph brand A dominates at all
temperatures. However, there is little interaction
because the difference between brands
increases as temperatures increase.
Statistical Modeling - 95
US Army Logistics Management College
Type of Interactions
• In this situation the interaction effect is interesting,
but not the main effect of brand - brand A is better
when averaged over all temperatures - is also
interesting.
• The situation is quite different in the next graph,
where there is a crossover.
Statistical Modeling - 96
US Army Logistics Management College
Type of Interactions
• Brand A is somewhat better at cool temperatures,
but brand B is better at mild and warm
temperatures.
• In this case the interaction is the most interesting
finding, and the main effect of brand is much less
interesting.
Statistical Modeling - 97
US Army Logistics Management College
Type of Interactions
• In simple terms, if you are a golfer, you’d buy
brand A in cool temperatures and brand B
otherwise, and you wouldn’t care very much which
brand is better when averaged over all
temperatures.
• For these reasons, we check first for
interactions in a two-way design.
• If there are significant interactions, then the main effects
might not be as interesting.
• However, if there are no significant interactions, then
main effects generally become more important.
Statistical Modeling - 98
US Army Logistics Management College
Main Effects versus Interactions
• Main effects are differences in average across the
levels on one factor, where these averages are
averages over all levels of the other factor.
• In a table of sample means, we can check for main
effects by looking at the averages in the “Grand
Total” column and row.
• In contrast, the interactions are patterns of averages
in the main body of the table and are best shown
graphically. They indicate whether the effect of one
factor depends on the level of the other factors.
Statistical Modeling - 99
US Army Logistics Management College
Two Way ANOVA Table
• The next question is whether the main effects and
interactions we see in the table of sample means are
statistically significant.
• As in a one-way ANOVA, this is answered by an
ANOVA table.
• However, instead of having just two sources of
variation, within and between, as in a one-way
ANOVA, there are now four sources of variation: one
for the main effect of each factor, one for interactions,
and one for the variation within treatment level
combinations.
Statistical Modeling - 100
US Army Logistics Management College
Analysis of Results
Golf Ball Testing
•
For the golf ball data, two-way ANOVA separates the
total variation across all 300 observations into four
sources.
1. There is variation due to different brands producing different
average yardages.
2. There is variation due to different average yardages at
different temperatures.
3. There is variation due to the interactions we saw in the
interaction graphs.
4. There is the same type of “within” variation as in one-way
ANOVA. This is the variation that occurs because yardages
for the 20 balls of the same brand hit at the same temperature
are not all identical.
Statistical Modeling - 101
US Army Logistics Management College
Output of Results
Golf Ball Testing
• Select the StatTools/
Statistical Inference/Twoway ANOVA menu item,
selecting Brand and Temp
as the “code” variables
and Yards as the
“measurement” variable
• Output includes tables
of sample sizes, sample
means, and sample
standard deviations, as
well as the ANOVA
table.
Statistical Modeling - 102
US Army Logistics Management College
Analysis of Results
Golf Ball Testing
• We test whether main effects or interactions are statistically
significant in the usual way - by examining p-values.
• Looking first at the interactions, the p-value is about 0.03, which
says that the lines in the interaction graphs are significantly nonparallel, at least at the 5% significant level. There is at least some
interaction between brand and temperature (although the
practical significance could be disputed).
• The two p-values for the main effects in cells G32 and G33 are
practically 0, meaning that there are differences across brands
and across temperatures.
• Of course, the main effect of temperature was a foregone conclusion
- we already know that balls do not go as far in cold temperatures but the main effect of brand is more interesting.
• According to the evidence, some brands definitely go farther, on
average, than some others.
Statistical Modeling - 103
Download