Six Sigma Black Belt Training

advertisement
LSSG Black Belt Training
Hypothesis Testing
Introduction
Always about a population parameter
Attempt to prove (or disprove) some assumption
Setup:
alternate hypothesis: What you wish to prove
Example: Change in Y after LSS Project
null hypothesis: Assume the opposite of what is
to be proven. The null is always stated as an
equality.
Example: Y after project is same as before
2
The test
1.
Take a sample, compute statistic of interest.
Standardized mean customer satisfaction score
2.
How likely is it that if the null were true, you
would get such a statistic? (the p-value)
How likely is it that the sample would show (by
random chance) the difference that we see after
the LSS project, if in fact there was no
improvement?
3.
4.
If very unlikely, then null must be false, hence
alternate is proven beyond reasonable doubt.
If quite likely, then null may be true, so not
enough evidence to discard it in favor of the
alternate.
3
Types of Errors
Null is really
True
reject null,
Type I Error
assume alternate is
proven
Believe in
improvement
when none
occured
do not reject null,
Good Decision
evidence for alternate
not strong enough
Null is really
False
Good Decision
Type II Error
Cannot show
improvement
when in fact it
occured
4
The Testing Process
1.
2.
3.
4.
5.
Set up Hypotheses (Null and Alternate )
Pick a significance level (alpha)
Compute critical value (based on alpha)
Compute test statistic (from sample)
Conclude: If |test statistic| > |critical value|,
then Alternate Hypothesis proven at alpha
level of significance.
5
Hypothesis Testing Roadmap
Hypothesis Testing
Continuous
Normal
Attribute
Non-Normal
c2 Contingency
Tables
Means
Variance
Medians
Variance
Correlation
Z-tests
c2
Correlation
Levene’s
t-tests
F-test
Sign Test
Same tests as
Non-Normal
Medians
ANOVA
Bartlett’s
Wilcoxon
Correlation
KruskalWallis
Regression
Mood’s
Friedman’s
6
Parametric Tests
Use parametric tests when:
1.
2.
3.
The data are normally distributed
The variances of populations (if more than one is
sampled from) are equal
The data are at least interval scaled
7
One sample z - test
A gap between two parts should be 15 microns. A sample of
25 measurements shows a mean of 17 microns. Test
whether this is significantly different from 15, assuming the
population standard deviation is known to be 4.
One-Sample Z
Test of mu = 15 vs not = 15
The assumed standard deviation = 4
N Mean SE Mean
95% CI
Z
P
25 17.0000 0.8000 (15.4320, 18.5680) 2.50 0.012
8
Z-test for proportions
You wish to test the hypothesis that at least two-thirds
(66%) of people in the population prefer your brand over
Brand X. Of the 200 customers surveyed, 140 say they
prefer your brand. Is this statistically significant?
Test and CI for One Proportion
Test of p = 0.66 vs p > 0.66
Sample X N Sample p
1
140 200 0.700000
95%
Lower
Bound Z-Value P-Value
0.646701
1.19
0.116
9
One sample t-test
Error
Reduction
%
10
12
9
8
7
12
14
13
15
16
18
12
18
19
20
17
15
The data show reductions in percentage of errors
made by claims processors after undergoing a
training course. The target was a reduction of at
least 13%. Was it achieved?
10
One Sample t-test – Minitab results
One-Sample T: Error Reduction
Test of mu = 13 vs > 13
95%
Lower
Variable
N Mean StDev SE Mean Bound T
P
Error Reduction 17 13.8235 3.9248 0.9519 12.1616 0.87 0.200
The p-value of 0.20 indicates that the reduction in Errors could not
be proven to be greater than 13%. P-value of 0.20 shows that the
probability is greater than alpha (0.05) that the difference may be
13% or less.
11
Two Sample t-test
You realize that though the overall reduction is not proven to
be more than 13%, there seems to be a difference between
how men and women react to the training. You separate the
17 observations by gender, and wish to test whether there is
in fact a significant difference between genders in error
reduction.
M
10
12
9
8
7
12
14
13
F
15
16
18
12
18
19
20
17
15
12
Two Sample t-test
The test for equal variances shows that they are not different for
the 2 samples. Thus a 2-sample t test may be conducted. The
results are shown below. The p-value indicates there is a
significant difference between the genders in their error reduction
due to the training.
Two-sample T for Error Reduction M vs Error Reduction F
N Mean StDev SE Mean
Error Red M 8 10.63 2.50 0.89
Error Red F 9 16.67 2.45 0.82
Difference = mu (Error Red M) - mu (Error Red F)
Estimate for difference: -6.04167
95% CI for difference: (-8.60489, -3.47844)
T-Test of difference = 0 (vs not =): T-Value = -5.02 P-Value = 0.000
DF = 15
Both use Pooled StDev = 2.4749
13
Chi-squared test of independence
For tabulated count data.
Two types of glass sheets are manufactured, and the
defects found on 111 sheets are tabulated based on
the type of glass (Type A and Type B) and the location
of the defect on each sheet (Zone 1 and Zone 2).
You wish to test whether the two variables (Type of
glass and Location of error on the glass) are
statistically independent of each other.
Intro to LSS
14
Chi Square Test Results
Tabulated statistics: Glass Type, Location
Rows: Glass Type
Columns: Location
Zone 1
Zone 2
All
Type A
29
29.98
23
22.02
52
52.00
Type B
35
34.02
24
24.98
59
59.00
All
64
64.00
47
47.00
111
111.00
Cell Contents: Count
Expected count
Pearson Chi-Square = 0.143, DF = 1, P-Value = 0.705
Likelihood Ratio Chi-Square = 0.143, DF = 1, P-Value = 0.705
15
Assignment
From the book “Doing Data Analysis with Minitab 14” by Robert Carver:
1.
2.
3.
Pages 138 – 142: Choose any 3 of the datasets mentioned on
those pages and answer the related questions. [1-sample means]
Pages 148 -151: Choose any 3 of the datasets mentioned on
those pages and answer the related questions. [1-sample
proportions]
Pages 165 -168: Choose any 3 of the datasets mentioned on
those pages and answer the related questions. [2-sample tests]
16
Basics of ANOVA
Analysis of Variance, or ANOVA is a technique used to
test the hypothesis that there is a difference between
the means of two or more populations. It is used in
Regression, as well as to analyze a factorial
experiment design, and in Gauge R&R studies.
The basic premise of ANOVA is that differences in the
means of 2 or more groups can be seen by
partitioning the Sum of Squares. Sum of Squares
(SS) is simply the sum of the squared deviations of the
observations from their means. Consider the following
example with two groups. The measurements show the
thumb lengths in centimeters of two types of
primates.
Total variation (SS) is 28, of which only 4 (2+2) is
within the two groups. Thus 24 of the 28 is due to the
differences between the groups. This partitioning of
SS into ‘between’ and ‘within’ is used to test the
hypothesis that the groups are in fact different from
each other.
See www.statsoft.com for more details.
Obs.
Type A Type B
1
2
3
2
3
4
6
7
8
Mean
SS
3
2
7
2
Overall
Mean = 5
SS = 28
17
Results of ANOVA
The results of
running an ANOVA on
the sample data from
the previous slide are shown
here. The hypothesis test
computes the F-value as the
ratio of MS ‘Between’ to
MS ‘Within’. The greater the
value of F, the greater the
likelihood that there is in fact
a difference between the groups.
looking it up in an F-distribution
table shows a p-value of 0.008,
indicating a 99.2% confidence that
the difference is real (exists in the
Population, not just in the sample).
One-way ANOVA: Type A, Type B
Source DF SS MS
F
P
Factor 1 24.00 24.00 24.00 0.008
Error 4 4.00
1.00
Total 5 28.00
___________________________________
S = 1 R-Sq = 85.71% R-Sq(adj) = 82.14%
Minitab: Stat/ANOVA/One-Way (unstacked)
18
Two-Way ANOVA
Strength
20.0
22.0
21.5
23.0
24.0
22.0
25.0
24.0
24.5
17.0
18.0
17.5
Temp
Low
Low
Low
Low
Low
Low
High
High
High
High
High
High
Speed
Slow
Slow
Slow
Fast
Fast
Fast
Slow
Slow
Slow
Fast
Fast
Fast
The results show
significant main effects
as well as an
interaction effect.
Is the strength of steel produced different
for different temperatures to which it is
heated and the speed with which it is
cooled? Here 2 factors (speed and temp)
are varied at 2 levels each, and strengths
of 3 parts produced at each combination
are measured as the response variable.
Two-way ANOVA: Strength versus Temp, Speed
Source
DF
SS
Temp
1 3.5208
Speed
1 20.0208
Interaction 1 58.5208
Error
8 5.1667
Total
11 87.2292
MS
F
P
3.5208 5.45 0.048
20.0208 31.00 0.001
58.5208 90.61 0.000
0.6458
S = 0.8036 R-Sq = 94.08% R-Sq(adj) = 91.86%
19
Two-Way ANOVA
The box plots give an indication of the interaction effect. The
effect of speed on the response is different for different levels
of temperature. Thus, there is an interaction effect between
temperature and speed.
B o x plo t o f S tr e ngth by T e mp, S pe e d
25
24
23
S t r e ng t h
22
21
20
19
18
17
16
S pe e d
Tem p
Fast
S lo w
High
Fast
S lo w
Lo w
20
Assignment
From the book “Doing Data Analysis with Minitab 14” by Robert Carver:
1.
2.
Pages 192 – 194: Choose any 3 of the datasets mentioned on those
pages and answer the related questions. [1-way ANOVA]
Pages 204 – 206: Choose any 3 of the datasets mentioned on those
pages and answer the related questions. [2-way ANOVA]
21
DOE Overview
A design of experiment involves controlling specific inputs (factors) at
various levels (typically 2 levels, like “High” and “Low” settings) to
observe the change in output as a result, and analyzing the data to
determine the significance and relative importance of factors.
The simplest case would be to vary a single factor, say temperature,
while baking cookies. Keeping all else constant, we can set temperature
at 350 degrees and 400 degrees, and make several batches at those two
temperatures, and measure the output desired – in this case it could be a
rating by experts of crispiness of the cookies on a scale of 0-10.
22
Full Factorial Designs
A 2F Factorial design implies that there are
factors at 2 levels each. The case
described on the previous slide with only
one factor is the simplest. Having two
factors at 2 levels would give us four
combinations. Three factors yield 8
combinations, 4 would yield 16, and so
forth.
The following table shows the full factorial
(all 8 combinations) design for 3 factors –



temperature,
baking time, and
amount of butter,
each at two levels – HIGH and LOW.
Temp
Time
Butter
Low
Low
Low
High
Low
Low
Low
High
Low
High
High
Low
Low
Low
High
High
Low
High
Low
High
High
High
High
High
23
Fractional Factorials
The previous example would require 8
different setups to bake the cookies. For
each setup, one could bake several
batches, say 4 batches, to get a measure
of the internal variation. In practice, as the
number of factors tested grows, it is
difficult to even create all the setups
needed, much less have replications
within a setup.
An alternative is to use fractional factorial
designs, typically a ½ or ¼. As the name
suggests, a ½ factorial design with 3
factors would only require 4 of the 8
combinations to be tested. This entails
some loss of resolution, usually a
confounding of interaction terms, which
may be of no interest to the experimenter,
and can be sacrificed.
Temp
Time
Butter
High
High
High
Low
Low
High
Low
High
Low
High
Low
Low
Minitab: Stat/DOE/Create Factorial Design/Display Factorial Designs
24
Running the Experiment – Outcome Values
Once the settings to be used are
determined, we can run the experiment
and measure the values of the outcome
variable. This table shows the values of
the outcome variable “Crisp”, showing the
crispiness index for the cookies, for each
of the 8 settings of the full factorial
experiment.
Temp Time Butter Crisp
Low
Low
Low
7
High
Low
Low
10
Low
High Low
7
High
High Low
5
Low
Low
High
4
High
Low
High
9
Low
High High
8
High
High High
8
25
Analysis of Data
Analyzing the data in Minitab for the main effects and ignoring interaction
terms, we get the following output:
Factorial Fit: Crispiness versus Temp, Time, Butter
Analysis of Variance for Crispiness (coded units)
Estimated Effects and Coefficients for Crispiness
(coded units)
Source
Main Effects
Residual Error
Total
Term
Effect Coef SE Coef
Constant
7.2500 0.3750
Temp
3.0000 1.5000 0.3750
Time
0.5000 0.2500 0.3750
Butter
1.5000 0.7500 0.3750
T
19.33
4.00
0.67
2.00
P
0.000
0.016
0.541
0.116
S = 1.06066 R-Sq = 83.64% R-Sq(adj) = 71.36%
DF Seq SS Adj SS Adj MS
3 23.000 23.000 7.667
4 4.500
4.500 1.125
7 27.500
F
P
6.81 0.047
Estimated Coefficients for Crispiness using data in uncoded
units
Term
Constant
Temp
Time
Butter
Coef
7.25000
1.50000
0.250000
0.750000
Note that only the temperature is significant (p-value lower than 0.05). The effect of
temperature is 3.00, which means that if temp. is set at HIGH, crispiness will increase
by 3.00 units on average, compared to the LOW setting.
Minitab: Stat/DOE/Create Factorial Design/Analyze Factorial Design
26
Assignment
From the book “Doing Data Analysis with Minitab 14” by Robert Carver:
1.
Pages 309 – 310: Answer any 4 of the 7 questions on those pages.
[DOE]
27
Hypothesis Testing Roadmap
Hypothesis Testing
Continuous
Normal
Attribute
Non-Normal
c2 Contingency
Tables
Means
Variance
Medians
Variance
Correlation
Z-tests
c2
Correlation
Levene’s
t-tests
F-test
Sign Test
Same tests as
Non-Normal
Medians
ANOVA
Bartlett’s
Wilcoxon
Correlation
KruskalWallis
Regression
Mood’s
Friedman’s
28
Non-Parametric Tests
Use non-parametric tests:
1.
2.
3.
4.
When data are obviously non-normal
When the sample is too small for the central limit
theorem to lead to normality of averages
When the distribution is not known
When the data are nominal or ordinal scaled
Remember that even non-parametric tests have some
assumptions about the data.
29
The sign test
The story:
A patient sign-in process at a hospital is being evaluated,
and the time lapse between arrival and seeing a physician
is recorded for a random sample of patients. You believe
that currently the median time is over 20 minutes, and wish
to test the hypothesis.
30
The sign test – data
Data for the test are as follows:
5
7
15
30
32
35
62
75
80
85
95
100
The histogram of the data shows that it is non-normal, and
the sample size is too small for the central limit theorem
to apply.
The data are at least ordinal in nature (here they are ratio
scaled), satisfying the assumption of the sign test.
H is to gr a m o f P r o c e s s T ime
3.0
2.5
Fr e q ue nc y
Process
Time
2.0
1.5
1.0
0.5
0.0
0
20
40
60
80
100
Pr o c e s s Time
31
Sign test - analysis
Since the hypothesis is that the median is greater than 20, the
test compares each value to 20. Those that are smaller get
a negative sign, those that are larger than 20 get a positive
one. The sign test then computes the probability that the
number of negatives and positives observed would come
about through random chance if the null were true (that the
median time is 20 minutes).
32
Sign Test in Minitab - Results
Sign Test for Median: Process Time
Sign test of median = 20.00 versus > 20.00
Process Time
N Below Equal Above
P
12
3
0
9
0.0730
Median
48.50
In this data, there are 9 observations above 20, and 3 of them
below. This can be shown to have a .0730 probability of
occurring, even if the median time for the population is in fact
not greater than 20. Thus, there is insufficient evidence to
prove the hypothesis (to reject the null) at the 5% level, but
enough if you are willing to take a 7.3% risk.
33
The sign test – other applications
The sign test can also be used for testing the value of the
median difference between paired samples, as
illustrated in the following link. The difference between
values in a paired sample can be treated as a single
sample, so any 1-sample hypothesis test can be applied.
In such a case, the assumption is that the pairs are
independent of each other.
The equivalent parametric tests for the sign test are the 1sample z test and the 1-sample t-test.
http://davidmlane.com/hyperstat/B135165.html
34
Wilcoxon Signed-Rank Test
A test is conducted where customers are asked to rate
two products based on various criteria, and come
up with a score between 0 and 100 for each. The
tester’s goal is to check whether product A, the new
version, is perceived to be superior to product B.
The null hypothesis would be that they are equal to
each other.
35
Wilcoxon Signed-Rank Test
The data
Prod A
55
60
77
82
99
92
86
84
90
72
Prod B
50
62
70
78
90
95
90
80
86
71
Diff
5
-2
7
4
9
-3
-4
4
4
1
The measures are rankings by people,
so the data are not necessarily interval
scaled, and certainly not ratio scaled.
Thus a paired sample t-test is not
appropriate. A non-parametric
equivalent of that is the Wicoxon
Signed-Rank Test.
This is similar to the sign test, but more
sensitive.
36
Wilcoxon Signed-Rank Test
Unlike the sign test, which only looks at whether something is larger or
smaller, this tests uses the magnitudes of the differences, rank
orders them, and then applies the sign of the difference and
computes the sum of those ranks. This statistic (called W) has a
sampling distribution that is approximately normal.
For details on the technique, see the link below.
Assumptions are:
1.
Data are at least ordinal in nature
2.
The pairs are independent of each other
3.
Dependent variable is continuous in nature
http://faculty.vassar.edu/lowry/ch12a.html
37
Wilcoxon test in Minitab - Results
Wilcoxon Signed Rank Test: Diff
Test of median = 0.000000 versus median > 0.000000
N
for
N Test
Diff 10 10
Wilcoxon
Statistic
44.5
P
0.046
Estimated
Median
2.500
38
Mann-Whitney Test – Two Samples
The Story:
Customers were asked to rate a service in the past, and 10
people did so. After some improvements were made,
data were collected again, with a new random set of
customers. Twelve people responded this time.
There is no pairing or matching of data, since the samples of
customers for the old and the new processes are
different.
http://faculty.vassar.edu/lowry/ch11a.html
39
Mann-Whitney Test – Two Samples
Old
60
70
85
78
90
68
35
80
80
75
New
85
85
90
94
90
70
75
90
90
100
95
90
Note that the assumptions of a 2-sample t-test are
violated because the data are not interval scaled,
and may not be normally distributed.
The Mann-Whitney Test is the non-parametric
alternative to the 2-sample t-test.
40
Mann-Whitney Test – Two Samples
The Mann-Whitney test rank orders all the data, with both
columns combined into one. The ranks are then separated
by group so the raw data is now converted to ranks. The
sum of the ranks for each column is computed.
The sums of ranks are expected to be in proportion to the
sample sizes, if there is no difference between the groups.
Based on this premise, the actual sum is compared to the
expected sum, and the statistic is tested for significance.
See details with another example on this link from Vassar
Univ. :
http://faculty.vassar.edu/lowry/ch11a.html
41
Mann-Whitney Test in Minitab - Results
Mann-Whitney Test and CI: Old, New
N Median
Old 10 76.50
New 12 90.00
Point estimate for ETA1-ETA2 is -14.00
95.6 Percent CI for ETA1-ETA2 is (-22.00,-5.00)
W = 72.5
Test of ETA1 = ETA2 vs ETA1 < ETA2 is significant at 0.0028
The test is significant at 0.0025 (adjusted for ties)
42
Kruskal-Wallis Test – 3 or more samples
Here the data would be similar to the Mann-Whitney
test, except for having more than 2 samples. For
parametric data, one would conduct an ANOVA to
test for differences between 3 or more
populations. The Kruskal-Wallis test is thus a nonparametric equivalent of ANOVA.
43
Kruskal-Wallis Test – Data
Adults Teens Children
7
5
6
4
2
6
5
9
9
8
5
9
10
7
8
3
4
3
5
10
2
The data show ratings of some
product by three different groups.
The same data are shown stacked
on the right to perform the test in
Minitab.
Rating
7
5
6
4
2
6
5
9
9
8
5
9
10
7
8
3
4
3
5
10
2
Factor
Adults
Adults
Adults
Adults
Adults
Adults
Adults
Teens
Teens
Teens
Teens
Teens
Teens
Teens
Teens
Children
Children
Children
Children
Children
Children
44
Kruskal-Wallis Test
The Kruskal-Wallis test proceeds very similarly to the MannWhitney test. The data are all ranked from low to high
values, and the ranks then separated by group. For each
group, the ranks are summed and averaged.
Each group average is compared to the overall average, and
the deviation measured, weighted by the number of
observations in each group. If the groups were identical,
the deviations from the grand mean would be a small
number (not 0, as one might intuitively think) that can be
computed.
The actual difference is compared to the expected one (H
statistic computed) to complete the test. See the link below
for details of the computation, if interested.
http://faculty.vassar.edu/lowry/ch14a.html
45
Kruskal-Wallis Test Minitab Results
Kruskal-Wallis Test: Rating versus Factor
Kruskal-Wallis Test on Rating
Factor N
Adults 7
Children 6
Teens
8
Overall 21
Median Ave Rank
5.000
8.6
3.500
7.2
8.500
15.9
11.0
Z
-1.23
-1.79
2.86
H = 8.37 DF = 2 P = 0.015
H = 8.48 DF = 2 P = 0.014 (adjusted for ties)
46
Mood’s Median Test
Mood median test for Rating
Chi-Square = 10.52 DF = 2
P = 0.005
Individual 95.0% CIs
Factor N<= N> Median Q3-Q1 --------+---------+---------+-------Adults
6 1 5.00 2.00
(-------*-----)
Children 5 1 3.50 3.50
(-----*----------------------)
Teens
1 7 8.50 1.75
(--------*-)
--------+---------+---------+-------4.0
6.0
8.0
Overall median = 6.00
The Mood’s median test is an alternative to Kruskal-Wallis. It is
generally more robust against violations of assumptions, but less
powerful.
47
Friedman’s Test
Friedman’s Test is the non-parametric equivalent to a
randomized block design in an ANOVA. In other
words, there are 3 or more groups, but each row
of values across the groups are matched.
The story
A person’s performance is rated in a normal state,
rated again after introducing noise in the
environment, and finally with the introduction of
classical music in the background. This is done for
a sample of 7 employees.
48
Friedman’s Test – Data
Normal Noise
7
8
6
9
5
7
8
5
4
6
5
5
4
4
Music
8
8
8
8
7
9
9
The data show the ratings of
performance by person in each of 3
conditions. The same data are
stacked in the table to the right, for
doing the test in Minitab. Each
person represents a block of data,
since the 3 numbers for that person
are related.
Perform
7
8
6
9
5
7
8
5
4
6
5
5
4
4
8
8
8
8
7
9
9
Group
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Noise
Noise
Noise
Noise
Noise
Noise
Noise
Music
Music
Music
Music
Music
Music
Music
Block
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
49
Friedman’s Test - Analysis
Friedman’s test also ranks the ratings, but this time the
ranking is done internally within each row – the three
scores for each person are ranked 1, 2, and 3.
These ranks are then summed and averaged.
If the groups are identical, then one would expect no
difference in the sum or mean of rankings for each
group. In other words, if the conditions did not affect
the performance rating, the rankings would either be
the same, or vary randomly across people to yield
equal sums.
The sums are compared to this expectation to test the
hypothesis. See the following link for more details.
http://faculty.vassar.edu/lowry/ch15a.html
50
Friedman’s Test in Minitab – Results.
Friedman Test: Perform versus Group blocked by
Block
S = 9.50 DF = 2 P = 0.009
S = 10.64 DF = 2 P = 0.005 (adjusted for ties)
Group
Music
Noise
Normal
N
7
7
7
Est
Median
8.000
4.667
7.333
Sum
of
Ranks
19.5
8.0
14.5
Grand median = 6.667
51
Assignment
From the book “Doing Data Analysis with Minitab 14” by
Robert Carver:
1.
Pages 293 – 294: Choose any 3 datasets on those
pages and answer the related questions. [Nonparametric tests]
52
Download