Descriptive Statistics

advertisement
DATA CONFUSION
How to confuse yourself and others
with Data Analysis
AGENDA FOR TODAY’S TALK
•
•
•
•
•
•
Good Graphs – Bad Graphs
The Law of Averages
PTBD Analysis
Enumerative & Analytical Problems
PARC Analysis
Wrong Methods of Analysis
“There are three kinds of lies:
Lies, damned lies and statistics”
Attributed to Benjamin Disraeli by Mark Twain
GOOD GRAPHS AND BAD
GRAPHS
DATA RELEVANCE
• Graphs are only as good as the data they
display
• No amount of creativity can produce good
graphs from dubious data
DATA CONTENT
• Don’t produce graphs from very small
amounts of data
• The human brain can grasp 1, 2 or 3
numbers without a graph
RULES FOR PRODUCING GOOD
GRAPHS
• KEEP IT SIMPLE AND STUPID
– Jesse Ventura
• Tell the truth – don’t distort the data
GOOD GRAPHS
• Portray information without distortion
• Contain no distracting elements
– No false third dimensions, irrelevant decoration, or
colour (chartjunk)
• Use an appropriate scale
• Label axes and tick marks properly, including
measurement units
• Have a descriptive title and/ or caption and legend
• Have a low ink – to – information ratio
GOOD GRAPH
BAD GRAPH
Temperature (degC) of Air and Subject during one day
40
40
Air
35
Subject
30
25
Air
20
Subject
15
Temperature (degC)
35
10
5
30
25
20
15
6 am
0
6 am
Noon
6 pm
Midnight
6 pm
Midnight
6 am
Time of Day
6 am
EVEN BETTER GRAPH
BAD GRAPH
Temperature (degC) of Air and Subject during one day
Temperature (degC) of Air and Subject during one day
100
40
Air
90
subject
Subject
35
Temperature (degC)
80
Temperature (degC)
Noon
70
60
50
40
30
20
30
25
air
20
10
0
15
6 am
Noon
6 pm
Time of Day
Midnight
6 am
6 am
Noon
6 pm
Time of Day
Midnight
6 am
18
16
14
12
BAD GRAPH
10
8
6
4
2
0
A
B
C
D
GOOD GRAPH
E
GOOD GRAPH
Boxplot of A, B, C, D, E
Dotplot of A, B, C, D, E
17
16
A
B
Data
15
C
14
D
E
13
11.5
12.0
12.5
13.0
13.5
12
A
B
Project: more chewchat data.MPJ; Worksheet: Worksheet 1; 12/04/2006; Graham Errington
C
D
E
Project: more chewchat data.MPJ; Worksheet: Worksheet 1; 12/04/2006; Graham Errington
14.0 14.5
Data
15.0
15.5
16.0
16.5
17.0
GRAPHS THAT CONFUSE
MONTHLY REJECTS
MONTHLY REJECTS
9000
100%
8000
80%
No. REJECTS
2001
6000
2002
5000
2003
4000
2004
3000
2005
2000
2005
60%
2004
2003
2002
40%
2001
20%
1000
0
0%
Jan Feb Mar Apr May Jun
Jul
Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun
MONTH
Jul Aug Sep Oct Nov Dec
MONTH
MONTHLY REJECTS
35000
30000
25000
No. REJECTS
No. REJECTS
7000
2005
2004
20000
2003
15000
2002
2001
10000
5000
0
Jan Feb Mar Apr May Jun
Jul Aug Sep Oct Nov Dec
MONTH
CHART JUNK
5811
2001
2002
7890
15000
6666
10000
7243 7174
No. REJECTS
4870 6009
4526
5000
6413
0
4989
Dec
4526
Nov
5432
Oct
4989
Jan
2001
Sep
6009 6261
Aug
4503
Jul
5693
Jun
6852
May
MONTH
4247 5452
Feb
2003
Apr
3505
Mar
3633 4745
Feb
7732
Jan
5321
0
5214
Mar
2005
4205
1000
5872
Apr
5163
2000
5301 4725
May
4601
3000
5458 5452
Jun
4000
3976
Jul
5000
6261 5801
Aug
6009
6000
Sep
2004
2005
6913 7174
7000
Oct
4205
8000
4432
Nov
2001
2002
2003
4476 5693 5872
Dec
5157 4605 4734
5940 5940 5811 5214
5186 5186 5157 5104 4725 4476 4567 5000
MONTHLY REJECTS
20000
2003
2004
25000
30000
35000
2005
100%
2001
2002
2003
2004
2005
80%
60%
40%
20%
2005
Jan
0%
Jan
Feb
Mar
Apr
May
2001
Jun
2002
Jul
2003
Aug
2004
2005
Sep
Oct
Nov
Dec
Feb
Mar
Apr
2003
May
Jun
Jul
Aug
Sep
2001
Oct
Nov
Dec
GRAPHS THAT TELL A STORY
I-MR Chart of No. REJECTS by YEAR
Time Series Plot of No. REJECTS
2001
8000
2003
Individual Value
6000
2004
2005
UCL=8406
1
8000
7000
No. REJECTS
2002
_
X=5926
1 1 1
6000
2 2
1 1
4000
LCL=3445
2000
n
Ja
5000
ar
M
M
ay
l
p
Ju S e
ov
N
2001
r
n
Ja Ma
M
ay
l
p
Ju S e
ov
N
2002
n
Ja Ma
r
ay
M
l
v
p
n
Ju S e N o Ja
MONTH
2003
ar
M
ay
M
l
p
Ju S e
ov
N
n
Ja
2004
M
ar
M
ay
l
p
Ju S e
ov
N
2005
UCL=3048
3000
Moving Range
1
4000
Project: Untitled; Worksheet: Worksheet 3; 04/04/2006; Graham Errington
2005 Nov
Jul
2005 Sep
2005
2005 May
2005 Mar
2005 Jan
2004 Nov
Jul
2004 Sep
2004
2004 May
2004 Mar
2004 Jan
2003 Nov
Jul
2003 Sep
2003
2003 May
2003 Mar
2003 Jan
2002 Nov
Jul
2002 Sep
2002
2002 May
2002 Mar
2002 Jan
2001 Nov
Jul
2001 Sep
2001
2001 May
YEAR
2001 Mar
MONTH
2001 Jan
3000
2000
1000
__
MR=933
1
0
LCL=0
2 2
n
Ja
ar
M
M
ay
l
Ju S e p
ov
N
n
Ja M
ar
M
ay
l
Ju S ep
ov
N
n
Ja M
ar
Project: Data for ChewChat 13 Apr 2006.MPJ; Worksheet: Worksheet 3; 04/04/2006; Graham Errington
ay
M
l
v
Ju S e p N o Ja n
MONTH
ar
M
ay
M
l
Ju S e p
ov
N
n
Ja
M
ar
M
ay
l
Ju S ep
ov
N
HISTOGRAMS
7263.
5714
6010.
7142
4757.
8571
20
10
0
3505
Frequency
Histogram
Frequency
Bin
Histogram of C20
14
12
10
Frequency
• No meaningless gaps
• Reasonable Choice of
bins
• Easy to choose or adjust
bins
• Good aspect ratio
• Meaningful labels on
axes
• Appropriate labels on
bin tick marks
8
6
4
2
0
4000
5000
Project: DATA FOR CHEWCHAT 13 APR 2006.MPJ; Worksheet: Worksheet 1; 07/04/2006; Graham Errington
6000
Data
7000
8000
TRENDING RANDOM VARIATION
“Upward trend”
“Setback”
“Downturn”
“Turnaround”
“Rebound”
“Downward trend”
THE LAW OF AVERAGES
“If I sit in a freezer and
plunge my head into a pan of boiling
chip fat. . . . .
on average, I’m quite comfortable.”
SHEWHART’S RULES FOR
PRESENTATION OF DATA
• Rule One
– Data should always be presented in a way that
preserves the evidence in the data
• Rule Two
– When an average, standard deviation or histogram
is used to summarize data, the user should not be
misled into to taking action they would not take if
the data were presented in a time series
USING THE WRONG METHODS
Process:
A
B
C
D
1
11.85
11.85
11.75
12.14
2
11.83
11.86
11.95
12.01
3
11.87
11.87
11.8
11.88
Descriptive Statistics: A, B, C, D
4
11.84
11.87
11.94
12.07
Variable N
5
11.85
11.88
11.95
11.95
A
20 11.950 0.102
0.85
11.83
12.08
6
11.86
11.89
12
11.87
B
20 11.950 0.100
0.84
11.85
12.25
7
11.85
11.89
12.05
12.06
C
20 11.950 0.102
0.86
11.75
12.15
D
20 11.950 0.100
0.84
11.81
12.14
8
11.85
11.9
11.85
11.94
9
11.84
11.92
11.94
11.84
10
11.86
11.91
11.85
12.05
11
12.05
11.93
12.05
11.93
12
12.06
11.93
11.85
11.83
13
12.03
11.95
12.05
12.04
14
12.02
11.97
11.95
11.92
15
12.03
11.96
11.95
11.82
16
12.04
11.99
11.95
12.03
17
12.06
12
11.85
11.91
18
12.06
12
12.1
11.81
19
12.04
12.16
12
12.01
20
12.08
12.25
12.15
11.81
Mean StDev CoefVar Minimum Maximum
NO SIGNIFICANT DIFFERENCE HERE!
One-way ANOVA: A, B, C, D
Source
Factor
Error
Total
DF
3
76
79
S = 0.1010
Level
A
B
C
D
N
20
20
20
20
SS
0.0000
0.7746
0.7746
MS
0.0000
0.0102
R-Sq = 0.00%
Mean
11.950
11.950
11.950
11.950
StDev
0.102
0.100
0.102
0.100
Pooled StDev = 0.101
F
0.00
P
1.000
R-Sq(adj) = 0.00%
Individual 95% CIs For Mean Based on
Pooled StDev
--------+---------+---------+---------+(-----------------*-----------------)
(-----------------*-----------------)
(-----------------*-----------------)
(-----------------*-----------------)
--------+---------+---------+---------+11.925
11.950
11.975
12.000
NO DIFFERENCE?!?
FOUR PROCESSES WITH SAME MEAN AND SD
Mean = 11.95, SD = .10
A
B
C
D
12.2
12.1
12.0
11.9
11.8
12.2
12.1
12.0
11.9
11.8
2
4
6
8
10 12 14 16 18 20
2
Sample
Project: ENUMERATIVE VS ANALYTICAL STUDIES.MPJ; Worksheet: Worksheet 1; 05/04/2006; Graham Errington
4
6
8
10 12 14 16 18 20
ALWAYS CARRY OUT PTBD ANALYSIS
PLOT THE B….. DOTS!
TYPES OF STATISTICAL STUDIES
• Descriptive
• Enumerative
• Analytic
DESCRIPTIVE STUDY
• Count all fish in barrel
• Count number of
goldfish
• Proportion of goldfish
applies to the fish
population in this barrel
and no other barrels of
fish
ENUMERATIVE STUDY
• Take a sample of fish from the
barrel, and count the number
of goldfish in the sample
• Point estimate of the
proportion of goldfish in the
barrel population
• Many statistical procedures do
this
• Cannot make any inference
about any other barrels of fish
ANALYTICAL STUDY
Fish Packing Process over Time
• Will we get the same proportion of goldfish in
the future as we got in the past?
• An analytical study allows prediction within limits
ANALYTICAL STUDY
C Chart of No goldfish per Barrel
10
UCL=10
8
Sample Count
• Proportion of goldfish is
stable over time
• Fish packing process is
predictable within limits
• We can expect, on
average, 4 goldfish per
barrel, but as many as 10
and as few as 0 in any
single barrel
6
_
C=4
4
2
0
LCL=0
1
3
5
7
9
11
Week No.
Project: ENUMERATIVE VS ANALYTICAL STUDIES.MPJ; Worksheet: Fish in Barrel; 05/04/2006; Graham Errington
13
15
17
19
ENUMERATIVE vs ANALYTICAL
METHODS
• Enumerative methods
– seek to provide numeric summaries, confidence
intervals,etc
– use significance tests, ANOVA, descriptive stats,
etc., assume single, stable population
• Analytical methods
– seek to understand the system under study
– use primarily graphical tools such as run charts,
control charts, histograms, box plots, etc
– in the real world, most problems are analytical
“Analysis of variance, t-tests, confidence intervals, and
other statistical techniques taught in books,….., are
inappropriate because they provide no basis for
prediction and because they bury the information
contained in the order of production.”
W.E. Deming, Out of the Crisis
Traditional statistical methods have their place,
but are widely abused in the real world. When this
is the case, statistics do more to cloud the issue
than to enlighten.
PARC ANALYSIS
Practical
Passive
Planning
Profound
Accumulated
Analysis (by)
After
Analysis
Records
Regression
Research
Relying (on)
Compilation
Correlations
Completed
Computers
note inverse relationship with
Continuous
Constant
Recording (of)
Repetition (of)
Administrative
Anecdotal
Procedures
Perceptions
PLANNING A PROCESS
IMPROVEMENT STUDY
•
•
•
•
•
•
•
•
•
Why collect the data?
What statistical methods for analysis?
What data will be collected?
How much data do we need?
How will the data be measured?
How good is the measurement system?
When and where will data be collected?
Who will collect the data?
Remember:
GARBAGE IN – GARBAGE OUT
WHAT’S SIGNIFICANT?
Two-sample T for C1 vs C2
N
Mean
StDev
Mean A = 13.7, Mean B = 14.4
SE Mean
A
5
13.652
0.487
0.22
B
5
14.369
0.646
0.29
Not significant?
Difference = mu (C1) - mu (C2)
Estimate for difference:
95% CI for difference:
-0.716615
(-1.551531, 0.118301)
T-Test of difference = 0 (vs not =): T-Value = -1.98
P-Value = 0.083
DF = 8
Both use Pooled StDev = 0.5725
Two-sample T for C3 vs C4
N
Mean
StDev
SE Mean
A
200
13.510
0.501
0.035
A
200
13.667
0.492
0.035
Mean A = 13.5, Mean B = 13.7
Difference = mu (C3) - mu (C4)
Estimate for difference:
95% CI for difference:
Significant?
-0.157292
(-0.254935, -0.059649)
T-Test of difference = 0 (vs not =): T-Value = -3.17
Both use Pooled StDev = 0.4967
P-Value = 0.002
DF = 398
WHAT SHOULD I DO WITH OUTLIERS?
•
•
•
•
•
•
Data point far away from the rest of the data
Don’t remove outliers to make data “look good”
Do you know why it is different?
If you do, remove it. If you don’t, leave it in
Could have a big impact on the analysis
Re – run analysis without outlier, and compare results
“REGRESSION” WITH EXCEL
• Usually means drawing an X-Y plot, fitting a straight
line and coming up with an R2 value.
• As long as R2 is high, everything’s hunky-dory.
WRONG!
“REGRESSION” WITH EXCEL
Defects vs Cure Time
6
y = 0.1913x - 5.5192
R2 = 0.5079
5
No. of Defects
4
3
2
1
0
-1
20
25
30
35
40
45
50
-2
Cure Time s
Relationship is clearly not linear, and should not be
presented as such
“REGRESSION” WITH EXCEL
• Regression model checking – in Excel?
• Residual plots:
– Normally distributed
– Random pattern when plotted vs fitted values
OK
Variance not
homogeneous
Model incorrect
PITFALLS OF REGRESSION ANALYSIS
•
•
•
•
•
•
Non-Linear Relationships
Influential Points
Extrapolating
Lurking Variables
Summary Data
Assuming Causation
• THAT’S (WITH REASONABLE PROBABILITY) THE
END FOLKS!
And remember,
• With statistics, you never have to say you’re certain!
• THANK YOU FOR YOUR ATTENTION
• ARE THERE ANY QUESTIONS?
• GOOD LUCK!!
Download