Healthcare Operations
Management
An Integrated Approach to
Improving Quality and Efficiency
CHAPTER 7. USING DATA AND STATISTICAL
TOOLS FOR OPERATIONS IMPROVEMENT
Daniel B. McLaughlin
Julie M. Hays
Chapter 7.Using Data and Statistical
Tools for Operations Management
•
•
•
•
•
•
•
Data Collection
Graphical Tools
Mathematical Descriptions
Probability and Probability Distributions
Confidence Intervals, Hypothesis Tests
ANOVA/MANOVA /MANCOVA
Regression
Copyright 2008 Health Administration Press. All rights reserved.
7-2
Data Collection
• Validity: A valid study has no logic, sampling,
or measurement errors.
- Logic
- Selection or sampling
- Measurement
Copyright 2008 Health Administration Press. All rights reserved.
7-3
Data Collection
Diagram created in
Inspiration® by
Inspiration
Software®, Inc.
Copyright 2008 Health Administration Press. All rights reserved.
7-4
Data Collection
Logic
• Why are the data needed?
• What will the data be used for?
• What questions are going to be asked of the
data?
• Are the patterns of the past going to be
repeated in the future?
Copyright 2008 Health Administration Press. All rights reserved.
7-5
Data Collection
Selection or Sampling
•
•
•
•
•
•
•
Census versus sample
Nonrandom methods
Simple random sampling
Stratified sampling
Systematic or sequential sampling
Cluster or area sampling
Sample size
Copyright 2008 Health Administration Press. All rights reserved.
7-6
Data Collection
Measurement
• Accuracy
• Precision
- How precise should
the measurements
be?
- Does the measurement
measure what we want
it to measure (i.e., say
= do)?
• Reliability
- Would the
measurement be
the same
if we
repeated
it?
Reliable, but
not accurate
Reliable and
accurate
Copyright 2008 Health Administration Press. All rights reserved.
Not reliable,
but accurate
7-7
Graphical Tools
•
•
•
•
•
•
Mapping
Visual representations of data
Histograms and Pareto charts
Stem plots, dot plots
Box (and whisker) plots
Normal probability plots
Copyright 2008 Health Administration Press. All rights reserved.
7-8
Graphical Tools
Histograms and Pareto Charts
Length of Hospital Stay
Diagnosis Category
14
12
10
Frequency
12
Frequency
10
8
8
6
4
6
2
4
0
2
H
rt
ea
0
1-2
3-4
5-6
7-8
9-10 11-12 13-14 15-16 17-18
D
e
as
e
is
s
s
ia
m
es
re
s
s
on
u
o
t
la
em
ch
ac
op
r
y
u
e
F
Ps
Pn
tN
n
na
ig
l
a
M
y
D
er
iv
l
e
Length of Hospital Stay (days)
Diagnosis
Microsoft Excel® screen shots reprinted with permission from Microsoft Corporation.
Copyright 2008 Health Administration Press. All rights reserved.
7-9
Graphical Tools
Dot Plots
Dotplot of C1
Length of Hospital Stay
3
6
9
12
Days
15
18
Produced with Minitab® Statistical Software
Copyright 2008 Health Administration Press. All rights reserved.
7-10
Graphical Tools
Turnip Graph
Percentage of diabetic Medicare enrollees receiving eye
exams among 306 hospital referral regions (2001)
Source: Wennberg, J. E. 2005. Data from the Dartmouth Atlas Project. Figure copyrighted by the Trustees of
Dartmouth College. Used with permission.
Copyright 2008 Health Administration Press. All rights reserved.
7-11
Graphical Tools
Normal Probability Plots
Length of Hospital Stay
1.00
.75
.50
.25
0.00
0.00
.25
.50
.75
1.00
Observed Cumulative Probability
Produced with SPSS for Windows
Copyright 2008 Health Administration Press. All rights reserved.
7-12
Graphical Tools
Scatter Plots
Strong Positive Correlation
Strong Negative Correlation
Y
Y
r = -0.86
X
r = 0.91
Positive Correlation
X
No Correlation
Y
Y
r = 0.70
X
r = 0.06
X
Microsoft Excel® screen shots reprinted with permission from Microsoft Corporation.
Copyright 2008 Health Administration Press. All rights reserved.
7-13
Mathematical Descriptions
Mean
• The mean is the arithmetic average of the
population:
Population mean  μ 
x
, where x  individual values and
N
N  number of values in the population .
• The population mean can be estimated from
a sample:
x

Sample mean  x 
, where n  number of values in the sample.
n
For our simple data set, x 
36853
 5.
5
Copyright 2008 Health Administration Press. All rights reserved.
7-14
Mathematical Descriptions
Median and Mode
• The median is the middle value of the sample or
population. If the data are arranged into an array
(an ordered data set):
3, 3, 5, 6, 8
5 would be the middle value or median.
• The mode is the most frequently occurring value.
In the above example, the value 3 occurs more
often (two times) than any other value, so 3 would
be the mode.
Copyright 2008 Health Administration Press. All rights reserved.
7-15
Mathematical Descriptions
Range and Mean Absolute Deviation
• The range is the difference between the
high and low values in a data set.
Range  x high  x low  8  3  5
• The mean absolute deviation (MAD) is the
average of the absolute value of the
differences from the mean.
xx

MAD 
n
2  2  0  1 3 8

  1.6
5
5
Copyright 2008 Health Administration Press. All rights reserved.
7-16
Mathematical Descriptions
Variance, Standard Deviation
• The variance is the average square difference
from the mean.
(x  μ)
4  4  0  1 9 18

Population variance  σ 


 3.6
2
2
Sample variance  s 2 
N
 (x  x)2
n-1
5
5
4  4  0  1 9 18

 4.5
5 1
4

• This standard deviation is the square root of the
variance.
 (x  μ)
2
Population standard deviation  σ 
2
Sample standard deviation  s 
2
N
 (x  x)
n
2


4  4  0  1 9
18

 3.6  1.9
5
5
4  4  0  1 9
18

 4.5  2.1
5 1
4
Copyright 2008 Health Administration Press. All rights reserved.
7-17
Mathematical Descriptions
Coefficient of Variation
The coefficient of variation (CV) is a measure
of the relative variation in the data. It is the
standard deviation divided by the mean.
σ
s 1.9
CV  or 
 0.4
μ
x
5
Copyright 2008 Health Administration Press. All rights reserved.
7-18
Probability and Probability
Distributions
•
•
•
•
•
Determination of probabilities
Properties of probabilities
Probability distributions
Discrete probability distributions
Continuous probability distributions
Copyright 2008 Health Administration Press. All rights reserved.
7-19
Determination of Probabilities
Observed Probability
Observed probability is the relative frequency
of an event—the number of times the event
occurred divided by the total number of trials.
P(A) 
Number of times A occured
r

Total number of observatio ns, trials, or experiment s n
Number of times patients are cured
r
P (drug is effective) 

Total number of patients given the drug n
Copyright 2008 Health Administration Press. All rights reserved.
7-20
Determination of Probabilities
Theoretical Probability
Theoretical probability is the theoretical
relative frequency of an event; the theoretical
number of times an event will occur divided by
the total number of possible outcomes.
Number of times A could occur
r
P(A) 

Total number of possible outcomes n
Number of spades in the deck
13
P (card is a spade) 

 0.25
Total number of cards in the deck 52
Copyright 2008 Health Administration Press. All rights reserved.
7-21
Determination of Probabilities
Opinion Probability
Opinion probability is a subjective
determination of the number of times an event
will occur divided by the imaginary total
number of possible outcomes or trials.
P(A) 
Opinion of number of times an event will occur r

Theoretica l total
n
P (Secretari at winning the Belmont Stakes) 
Opinion on the number of times Secretariat would win the Belmont r

Imaginary total number of times the Belmont would be run
n
Copyright 2008 Health Administration Press. All rights reserved.
7-22
Properties of Probabilities
Bounds on Probability
• Probabilities always must be 0, and an event that
cannot occur has a probability of 0.
P(A) 
Least number of times A could occur
0

0
Total number of possible outcomes
Any number
• Probabilities must always be 1.
P(A) 
Greatest number of times A could occur n
 1
Total number of possible outcomes
n
0  P(A)  1
• P(A) + P(A') = 1 and 1 − P(A') = P(A), where A' is
not A.
Copyright 2008 Health Administration Press. All rights reserved.
7-23
Properties of Probabilities
Multiplicative Property
For two independent events, the probability of
both A and B occurring, or the intersection ()
of A and B, is the probability of A occurring
times the probability of B occurring.
P(A and B occurring) = P(A  B) = P(A) x P(B)
Copyright 2008 Health Administration Press. All rights reserved.
7-24
Properties of Probabilities
Multiplicative Property
Coin Toss
H
Die Toss
Probability
1
1/12
2
1/12
3
1/12
4
1/12
5
1/12
6
1/12
1
1/12
2
1/12
3
1/12
4
1/12
5
1/12
6
1/12
P(3) = 1/6
P(H) × P(3) =
P(H  3) = 1/12
Start
T
P(H) = 1/2
Copyright 2008 Health Administration Press. All rights reserved.
1/2 × 1/6 = 1/12
7-25
Properties of Probabilities
Additive Property
• For two events, the probability of A or B
occurring, or the union () of A with B, is the
probability of A occurring plus the probability
of B occurring, minus the probability of both
A and B occurring.
P(A or B occurring) = P(A  B) = P(A) + P(B) + P(A  B)
Copyright 2008 Health Administration Press. All rights reserved.
7-26
Properties of Probabilities
Additive Property
Coin Toss
H
Die Toss
Probability
1
1/12
2
1/12
3
1/12
4
1/12
5
1/12
6
1/12
P(H  3) = 7/12
Start
T
P(H) = 1/2
1
1/12
2
1/12
3
1/12
4
1/12
5
1/12
6
1/12
P(3) = 1/6
Copyright 2008 Health Administration Press. All rights reserved.
P(H) + P(3) − P(H  3) = 7/12
7-27
Properties of Probabilities
Conditional Probability
The probability of an event occurring if more
information is obtained:
P( A  B)
P( A B) 
P (B )
Contingency Table for ER Wait Times
30 minute wait
>30 minute wait
Friday night
20
30
50
Other times
40
10
50
60
40
100
Copyright 2008 Health Administration Press. All rights reserved.
7-28
Properties of Probabilities
Conditional Probability
• Note that:
P ( A  B)  P ( A B)  P (B)  P (B A)  P ( A)
and if one event has no effect on the other
event (the events are independent), then
.
P( A B)  P( A) and P ( A  B)  P ( A)  P (B)
• Bayes’ theorem
P (B A)  P ( A)
P ( A  B ) P (B A)  P ( A)
P ( A B) 


P (B )
P (B )
P (B A)  P ( A)  P (B A)  P ( A)
Copyright 2008 Health Administration Press. All rights reserved.
7-29
Probability Distributions
Discrete Probability Distributions
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
The Poisson distribution is
used to model the number
of events in a specific
period.
e   x
P( x ) 
x!
0.25
0.2
Probability
Probability
The binomial distribution
describes the number of
times a binary event will
occur in a sequence of
events.
n!
P(x) 
p x( 1  p)n  x
x!(n  x)!
0.15
0.1
0.05
0
0
1
2
3
Number of Heads in 3 Tosses
Copyright 2008 Health Administration Press. All rights reserved.
1
2
3
4
5
6
7
8
9
10 11
Number of Patient Arrivals in 1 Hour
7-30
Probability Distributions
Continuous Probability Distributions
In the uniform distribution,
the probability of
occurrence is the same for
all outcomes.
1
for a  x  b
ba
 2(x  a)
(b  a)(c  a) for a  x  c

P(x)  
 2(b  x) for c  x  b
(b  a)(b  c)
Min = 0.0, Mode = 0.5, Max = 2.0
1.2
1
0.8
0.6
0.4
0.2
0
P(X)
P(X)
P( x ) 
The triangular distribution
is described by the mode,
minimum, and maximum
values.
a
X
b
0
0.2 0.4 0.6 0.8
Copyright 2008 Health Administration Press. All rights reserved.
1
X
1.2 1.4 1.6 1.8
7-31
2
Probability Distributions
Exponential Distribution
The exponential distribution is used to
model arrival rate, the rate of occurrence of
an event.
P( x )  e x for x  0
 = mean = 1/, median = ln(2)/, mode = 0, and  = 1/
2
P(X)
1.5
lambda = 2
1
0.5
0
0
X
1
Copyright 2008 Health Administration Press. All rights reserved.
2
7-32
Probability Distributions
Normal Distribution
P(x) 
1
2Πσ 2
e
0.6
  0,   1.0
  0,   2.5
  2,   0.7
0.4
P(X)
The normal
distribution, x ~N(,2),
is commonly observed
in the world and
provides a reasonable
approximation for
many randomly
distributed variables.
0.2
(x  μ) 2 / 2σ 2
0
-5
Copyright 2008 Health Administration Press. All rights reserved.
-3
-1
X
1
3
5
7-33
Probability Distributions
Standard Normal Distribution
z-score limits
Proportion within the
limits (if normally
distributed)
+/− 1 z
0.680
+/− 2 z
0.950
+/− 3 z
0.997
  0,   1.0
0.4
P(X)
The standard normal distribution,
z distribution, is the normal
distribution with  = 0 and  =
1.0. Any normal distribution can
be transformed to a standard
xμ
normal distribution by: z 
σ
0.2
0
Copyright 2008 Health Administration Press. All rights reserved.
-5
-3
-1
1
3
5
X
7-34
Confidence Intervals, Hypothesis Testing
•
•
•
•
•
•
Central Limit Theorem
Hypothesis testing
Type I () and Type II () errors
T-tests
Proportions
Practical significance versus statistical
significance
Copyright 2008 Health Administration Press. All rights reserved.
7-35
Confidence Intervals, Hypothesis Testing
Central Limit Theorem
• As the sample size becomes large, the
sampling distribution of the mean
approaches normality, no matter what the
distribution the original variable, and
 x   and  x  
n
Sampling Distribution Simulation
Copyright 2008 Health Administration Press. All rights reserved.
7-36
Confidence Intervals
Confidence interval for the true value of the
population mean:
x  z / 2 *  x    x  z / 2 *  x
x  z / 2 *

n
.    x  z / 2 *

n
95%
P(X)
0.4
0.2
2.5%
2.5%
0
-3
-2
-1
0
Z
Copyright 2008 Health Administration Press. All rights reserved.
1
2
3
7-37
Hypothesis Testing
• Belief or null hypothesis, Ho:  = b
• Alternate belief or hypothesis, Ha:   b
• Decision rule: If z  z* , reject the null
hypothesis. Where z  x   :
x
-Z*< Z < Z* (95% confidence)
P(X)
0.4
0.2
Z<-Z*
Z>Z*
0
-3
-2
-1
0
Z
Copyright 2008 Health Administration Press. All rights reserved.
1
2
3
7-38
Hypothesis Testing
Type I () and Type II () Errors
Ho: 1=2
Ha: 12
Type I and Type II Error—Clinic Wait Time Example
Reality
Wait times at Wait times at the
the two clinics two clinics are
are the same
NOT the same
1=2
Wait times at the
two clinics are the
Assesssame
ment or
Wait times at the
guess
two clinics are
NOT the same
Type II or
 error
1=2
12
Copyright 2008 Health Administration Press. All rights reserved.
12
Type I or
 error
7-39
Equal Variance t-Test
• t-tests are used to test hypotheses about
two means.
• Ho: 1=2
Ha: 12
• Decision rule: If t  t*, reject Ho
(x1  x2 )  (μ1  μ2 )
t
1 1
sp

n1 n2
(n1  1)s12  (n2  1)s22
where s p 
n1  n2  2
• Confidence interval


1
1
1
1
*
( x1  x 2 )  t * s p
   1  2  ( x1  x 2 )  t * s p
 
n1 n2 
n1 n2 


*
Copyright 2008 Health Administration Press. All rights reserved.
7-40
Proportions
Ho: 1= 2
Ha: 12
Decision rule: If z  z*, reject Ho
( p1  p2 )  (1   2 )
n1p1  n2 p2
z
where p 
p(1  p ) p(1  p )
n1  n2

n1
n2
Confidence interval
( p1  p2 )  z *
p(1  p) p(1  p)
p(1  p) p(1  p)

  1   2  ( p1  p2 )  z *

n1
n2
n1
n2
Copyright 2008 Health Administration Press. All rights reserved.
7-41
Practical Significance Versus
Statistical Significance
• Basic confidence interval
statistic – [(z*) * (s.e. statistic)]  parameter
 statistic + [(z*) * (s.e. statistic)]
• As n increases, s.e. decreases and the
confidence interval gets larger.
• Large samples may give statistically
significant results that are not practically
significant.
Copyright 2008 Health Administration Press. All rights reserved.
7-42
ANOVA/MANOVA/MANCOVA
• One-way ANalysis Of VAariance (ANOVA) is used
to test hypotheses about three or more levels of
treatment. A t-test will give the same information
as an ANOVA when there are only two treatment
levels of interest.
• Two-way and higher ANOVAs are used when
there is more than one type of treatment variable
of interest.
• MANOVA/MANCOVA are used when there is more
than one outcome or dependent variable of
interest.
Copyright 2008 Health Administration Press. All rights reserved.
7-43
Regression
• Simple linear regression—used to describe
the relationship between two variables
• Multiple regression—used to describe the
relationship between multiple predictor
variables and a single dependent variable
• General linear model
• Artificial neural networks
• Design of experiments
Copyright 2008 Health Administration Press. All rights reserved.
7-44
What Is the Equation of a Line?
Algebra
y  mx  b
Statistics
Ŷ  bX  a
Where
rise Δy
b  slope 

run Δx
a  y intercept
Copyright 2008 Health Administration Press. All rights reserved.
 y, when x  0
7-45
Problem
Student A owns a health insurance firm and
wants us to determine the cost (price would
be a more difficult problem) of providing
healthcare to insured individuals.
Copyright 2008 Health Administration Press. All rights reserved.
7-46
Seeing the Future
Data
Experiences
are relevant
Judgment: To what degree are
these experiences still
relevant?
Experiences
are irrelevant
Deductive reasoning versus inductive reasoning
Copyright 2008 Health Administration Press. All rights reserved.
7-47
What Is the Cost of Healthcare
Related To?
Quantitative
______________
______________
______________
______________
______________
______________
Copyright 2008 Health Administration Press. All rights reserved.
Qualitative
_____________
_____________
_____________
_____________
_____________
_____________
7-48
Selection
•
•
•
•
Define population
Census or sample
Type of sample
Measurement—accurate, reliable, precise?
X = number of dependents; Y = annual
healthcare expense ($1,000)
• Is the study valid?
• How do we create knowledge from data?
Copyright 2008 Health Administration Press. All rights reserved.
7-49
Data
Number of
Dependents
0
Annual
Healthcare
Expense
($1,000)
3
1
2
2
6
3
7
4
7
Copyright 2008 Health Administration Press. All rights reserved.
7-50
Scatterplot
Y—Annual Healthcare Cost $1,000
10
y = 1.3x + 2.4
9
8
7
6
y=x+3
5
y=5
y = 1.2x + 2
4
3
2
1
0
0
1
2
3
4
5
6
X—Number of Dependents
Copyright 2008 Health Administration Press. All rights reserved.
7-51
Scatterplot Questions
• Which is the “best” line on the scatterplot?
• How would you define “best” (e.g., must be
quantifiable)?
Copyright 2008 Health Administration Press. All rights reserved.
7-52
Professor’s Model
ˆ  bX  a
Y
ˆ  cost estimate ($1,000)
Y
a  Y intercept  3
Y
b  slope 
1
X
ˆ  1X  3 knowledge
Y
Copyright 2008 Health Administration Press. All rights reserved.
7-53
Model Comparison
Prof’s
Yˆ  1.2( X ) Yˆ  1.3( X )
e=
2
 2 .4
Y − Yhat Student 1 Student 2
e
e
X
Y
Yhat =
X+3
0
3
3
0
−1
−0.6
1
2
4
-2
1.2
1.7
2
6
5
1
−1.6
−1
3
7
6
1
−1.4
−0.7
4
7
7
0
−0.2
0.6
0
−3
0
 (sum)
Copyright 2008 Health Administration Press. All rights reserved.
7-54
Good Model
• A good model must be unbiased.
e = 0
• Is that enough? What else? Does this
remind you of 2?
• How do we get rid of signs?
Copyright 2008 Health Administration Press. All rights reserved.
7-55
Model Comparison
X
Y
Yhat =
X+3
e=
Y − Yhat
e2
Student 1
e2
0
3
3
0
0
1
1
2
2
−2
4
1.44
2
6
6
1
1
2.56
3
7
7
1
1
1.96
4
7
7
0
0
0.04
 (sum)
25
25
0
6
7
Copyright 2008 Health Administration Press. All rights reserved.
7-56
Least Squares Technique
Gauss proved that if you use:
(Y  Y)(X  X)
b
and a  Y  bX
2
(X  X)
You are guaranteed that
e = 0 and e2 is a minimum.
Yhat = 1.3X + 2.4, e = 0, and e2 = 5.1.
Copyright 2008 Health Administration Press. All rights reserved.
7-57
Coefficient of Determination
Are we better off making estimates by using
information (X = number of dependents) and
having created knowledge (Yhat = 1.3X +
2.1) than using no information or knowledge
(i.e., is the model “better”)?
How would you estimate without using our
knowledge (our model)?
Copyright 2008 Health Administration Press. All rights reserved.
7-58
Sum of Squares Total
X
Y
Yhat = Ybar
e=Y−
Ybar
SSTO
(Y −
Ybar)2
0
3
5
−2
4
1
2
5
−3
9
2
6
5
1
1
3
7
5
2
4
4
7
5
2
4
 (sum)
25
25
0
22
Note that this method is unbiased.
Copyright 2008 Health Administration Press. All rights reserved.
7-59
Graph
10
Y—Annual Healthcare Cost $1,000
9
8
7
6
5
y=5
4
3
2
1
0
0
1
2
3
4
5
6
X—Number of Dependents
Copyright 2008 Health Administration Press. All rights reserved.
7-60
Y—Annual Healthcare Costs $1,000
Errors
8
7
6
5
4
3
2
1
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
X—Number of Dependents
Copyright 2008 Health Administration Press. All rights reserved.
7-61
Sum of Squares Error
e=
Y−
Yhat
SSE
e2
= (Y −
Yhat)2
Ybar
Y−
Ybar
SSTO
(Y −
Ybar)2
X
Y
Yhat =
1.3X +
2.4
0
3
2.4
0.6
0.36
5
−2
4
1
2
3.7
−1.7
2.89
5
−3
9
2
6
5
1.0
1.00
5
1
1
3
7
6.3
0.7
0.49
5
2
4
4
7
7.6
−0.6
0.36
5
2
4
 (sum)
25
25
0
5.1
25
0
22
Copyright 2008 Health Administration Press. All rights reserved.
7-62
Coefficient of Determination
What is the percentage of improvement when
we use knowledge gained from our model?
New error level  old error level
% improvemen t 
Old error level
5.1  22  16.9


 100  77%
22
22
r2 = coefficient of determination = 77%
r2 = 0.77
Copyright 2008 Health Administration Press. All rights reserved.
7-63
Another Viewpoint
Variation in cost of removal is either explained
by knowledge (the model) or not explained.
Copyright 2008 Health Administration Press. All rights reserved.
7-64
Explained and Unexplained Error
Y—Annual Healthcare Costs $1,000
8
7
6
5
4
3
----- Explained
2
___ Unexplained
1
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
X—Number of Dependents
Copyright 2008 Health Administration Press. All rights reserved.
7-65
Sum of Squares Regression
e=
Y−
Yhat
SSE
e2
= (Y −
Yhat)2
SSTO
(Y −
Ybar)2
Yhat
–
Ybar
SSR
(Yhat
−
Ybar)2
X
Y
Yhat =
1.3X +
2.4
0
3
2.4
0.6
0.36
5
−2
4
−2.6
6.76
1
2
3.7
−1.7
2.89
5
−3
9
−1.3
1.69
2
6
5
1.0
1.00
5
1
1
0
0
3
7
6.3
0.7
0.49
5
2
4
1.3
1.69
4
7
7.6
−0.6
0.36
5
2
4
2.6
6.76

35
(sum)
25
0
5.1
25
0
22
0
16.9
Y
Y−
bar Ybar
Copyright 2008 Health Administration Press. All rights reserved.
7-66
Coefficient of Determination
Explained
SSR 16.9
r 


 0.77
Total
SSTO 22.0
2
Note: r2 is not based on statistics or
probability; it is just a percentage.
Copyright 2008 Health Administration Press. All rights reserved.
7-67
Correlation Coefficient
r =  r2
r = Correlation coefficient
= Measure of the strength of the linear
relationship between two variables
−1  r  1
r = −1
Copyright 2008 Health Administration Press. All rights reserved.
r = +1
7-68
Correlation Coefficient Examples
r = 0.0
r = 0.9
r = −0.5
Copyright 2008 Health Administration Press. All rights reserved.
7-69
Coefficient of Determination
Questions:
• If r2 is low, does that mean there is no
relationship between your variables?
• If r2 is high (close to 1), does that mean you
always get useful predictions from your model?
• If r2 is high, does that mean your model has a
“good” fit?
Copyright 2008 Health Administration Press. All rights reserved.
7-70
2
r
and Curves
• Can we fit a straight line to this?
• Yes, and we are guaranteed that the errors
sum to zero and are a minimum.
• However, a curve would be better.
Y
Copyright 2008 Health Administration Press. All rights reserved.
X
7-71
Excel Output
To get this sheet, go to Tools -> Data Analysis -> Regression. If you don't have Data Analysis
listed in your tools, see Excel help "Install and Use the Analysis ToolPak.”
X—Number of
Dependents
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8765
R Square
0.7682
Adjusted R
Square
0.6909
Standard
Error
0.8790
Observations
5
SS
7.6818
2.3182
10
Coefficients Standard Error
-0.9545
1.0162
MS
7.6818
0.7727
F
Significance F
9.9412
0.0511
Predicted X
—Number of
Dependents
t Stat
P-value
-0.9393
0.4169
Residual Plot
1.0000
0.5000
Lower 95%
Upper 95% Lower 90.0%
0.0000 Upper 90.0%
-4.1885
2.2794
-3.3460
1.4369
2
-0.5000 0
4
6
8
-1.0000
0.5909
RESIDUAL OUTPUT
Predicted X Number of
Observation Dependents
1
0.8182
2
0.2273
3
2.5909
4
3.1818
5
3.1818
0.1874
3.1530
Standard
Residuals
Residuals
-0.8182
-1.0747
0.7727
1.0150
-0.5909
-0.7762
-0.1818
-0.2388
0.8182
1.0747
0.0511
-0.0055
1.1873
PROBABILITY OUTPUT
X - Number
of
Percentile
Dependents
10
0
30
1
50
2
70
3
90
4
Copyright 2008 Health Administration Press. All rights reserved.
Y—$ 1,000 Annual Healthcare Expense
1.0320
0.1499
Normal Probability Plot
X—Number
of
Dependents
Intercept
Y - $ 1000
Annual
Health Care
Expense
1
3
4
Residuals
df
X—Number
of
Dependents
0
2
4
6
8
Y—$ 1,000 Annual Healthcare Expense
ANOVA
Regression
Residual
Total
Line Fit Plot
5
4
3
2
1
0
5
0
0
20
40
60
80
100
Sample Percentile
7-72
F Test
MSR
SSR / 1

 F*
MSE SSE / n  2
If F* > F(1-;1;n-2), reject H0:  = 0 (in this case)
MSR/MSE  1   = 0
MSR/MSE  big    0
Copyright 2008 Health Administration Press. All rights reserved.
7-73
Assumptions of Linear
Regression
Linear regression is based on several
assumptions. If these assumptions are
violated, the resulting model will be
misleading. The principal assumptions are:
- The dependent and independent variables are
linearly related.
- The errors associated with the model are not
serially correlated.
- The errors are normally distributed and have
constant variance.
Copyright 2008 Health Administration Press. All rights reserved.
7-74
Transformations
X
Y
Transform
X ->X2
−3
9
9
−2
4
4
−1
1
1
0
0
0
1
1
1
2
4
4
3
9
9
Y
If the variables are not linearly related or the
assumptions of regression are violated, the variables
can be transformed to produce a possibly better
model.
10
8
6
4
2
0
0
2
Copyright 2008 Health Administration Press. All rights reserved.
4
6
8
10
X2
7-75
Multiple Regression
• Multiple independent variables are used to
predict a single dependent variable to
“improve” the model.
• Y =  + 1X1 + 2X2 + … + kXk + 
• Multicollinearity can be a problem.
Copyright 2008 Health Administration Press. All rights reserved.
7-76
General Linear Model
• The most general of all linear models
• Multiple predictor variables:
- Metric
- Categorical
- Both
• Multiple dependent variables:
- Metric
- Categorical
- Both
• Can be used to build complex models
Copyright 2008 Health Administration Press. All rights reserved.
7-77
Artificial Neural Networks
Neural Networks
• Large amounts of data
• No explanation of
how/why
• Used to predict
outcomes
Traditional Models
• Limited amount of data
• Model explains
how/why
• Used to predict
outcomes
Copyright 2008 Health Administration Press. All rights reserved.
7-78
Outline for Analyses
1. Define the problem/question.
2. Determine what data will be needed to address
the problem question.
3. Collect the data.
4. Graph the data.
5. Analyze the data using the appropriate tool.
6. “Fix” the problem.
7. Evaluate the effectiveness of the “fix.”
8. Start again.
Copyright 2008 Health Administration Press. All rights reserved.
7-79
Choice of Statistical Technique
Independent
Variable
Categorical
Dependent
Variable
One
Categorical
Metric
Many
Categorical
Metric
Mathematical
Graphical
One
2
Many
2 (layered)
One
t-Test
Histogram
type
Many
MANOVA
Box plot
One
2
Many
2 (layered)
One
ANOVA
Many
MANOVA
Both
Copyright 2008 Health Administration Press. All rights reserved.
Box plots
GLM
7-80
Choice of Statistical Technique
Independent
Variable
Metric
Dependent
Variable
One
Categorical
Mathematical
One
Graphical
Logit
Many GLM
Metric
One
Simple regression
Scatterplot
Many GLM
Both
Many Categorical
MANCOVA
One
Logit
Many GLM
Metric
One
Multiple regression
Many GLM
Both
GLM; neural net
Copyright 2008 Health Administration Press. All rights reserved.
7-81
Choice of Statistical Technique
Independent
Variable
Dependent
Variable
Both
Categorical
Metric
Mathematical
One
ANCOVA
Many
MANCOVA
One
Simple regression
Many
Multiple regression
Both
Copyright 2008 Health Administration Press. All rights reserved.
Graphical
GLM
Neural Net
7-82