Module Four: Normal Distribution and Diagnosis of Normality

advertisement
Module Four: Normal distribution and it’s applications to interlaboratory testing
When we conduct an inter-laboratory testing, we often observe continuous
variables,
e.g., the amount of chloride of a water sample, the beta-carotene in a blood sample, the blood pressure are
continuous variables.
When we construct a relative frequency histogram, it is very likely that the shape of the distribution is bellshaped, that is a few possible values are small, a few are large, and most of them are around the average.
Such type of distribution is what we call NORMAL distribution.
Fox example, Blood Pressure, the beta-carotene in a blood sample, amount of chloride of a
water sample mostly follow normal curves.
A histogram with imposed normal curve for 1900 individuals’
systolic blood pressure
Histogram of Systolic Blood Pressure, with Normal
Curve, for indivduals age from 15-20
m=115,
400
s = 14
300
Frequency
The imposed smooth curve looks
like a bell-shape. If the blood
pressure follows a normal curve
with mean 115 and s.d. 14,
We use the notation: X ~ N(m,s)
For this case, X ~N (115,14).
200
An immediate question is: How can
we detect if the distribution
indeed follows a normal curve.
100
0
60
110
160
210
Systolic Blood Pressure
Our interest may be to check if the blood pressure follows a normal
distribution, to find out what proportion of individuals whose blood
pressure is at risk (150 ml or higher), or to identify extreme cases.
When and How do you use Normal Distribution in real
world situations?
Normal curve describes the probability of occurrences of many real
situations.
•Most of statistical techniques, including the techniques used for analyzing interlaboratory testing data, assume that the response variable approximately follows
a normal curve.
•These methods may not be valid if the response does not follow a normal
distribution. It is, therefore, important to learn how to check if a response variable
follows a normal distribution or not. For this reason, we need to learn some basic
properties of a normal distribution, to learn how to compute probabilities and
percentiles for a normal distribution.
•In this module, we will discuss:
•The use of z-table and Minitab to compute probabilities and percentiles.
•Techniques of checking if a response variable follows a normal distribution.
The
normal probability distribution provides a good model for describing
data that have mound-shaped frequency distributions.
The Normal Probability Distribution:

( xm )2
1
2s 2
f ( x) =
e
s 2p
where e = 2.718 and p = 3.142; m and s (s > 0 ) are the parameters that
represent the population mean and standard deviation.
We will use the notation: X ~ N(m , s). This means
X is distributed as Normal with mean m and standard deviation s.
Some examples of normal random variables are :
X = Adult Height ,
X = Scores of s national test,
X = Gas price, X = Blood pressure
NOTE: X = salary of individuals who are 40 years or old before retire does not
follow a normal curve. It is a skewed to right distribution.
Properties of Normal Distribution

This figure shows three such distributions with differing values of m and s .
s2
s1
s3
m1 m2 m3




Mean determines the center. In this case, m1 < m2 < m3
Standard deviation measures the variability. In this case, s2 < s1 < s3
Large values of s reduce the height of the curve and increase the spread.
Small values of s increase the height of the curve and reduce the spread.
Some properties for X ~ N(m , s)
f(x)
P( m-a <X<m)
=
P(m<X<m+a)
P(X<m-a)
P(X > m+a)
m-a
Also:
m
P(X> m) = P(X < m) = .5
m+a
X
Example:
Every year, universities recruit students using their SAT scores.
Based on the previous information, we know that SAT scores
follows a normal curve with the mean 1000 and standard
deviation 180. In the past, CMU admits students with SAT
1090 or higher.
Q1: What is the percent of high school students who can receive CMU admission?
Q2: If CMU decides to higher the SAT admission limit to only admit the top 20% of high
school graduates. What should be the new SAT admission limit?
Q3: A student scored 1200, and claim he is in the top 10%. Is this a correct claim?
Tabulated Areas of the Normal Probability
Distributions
•
How do you solve the SAT admission problem?
First, we need to rewrite the problem using the notation we are familiar.
Let call X = SAT scores. Then from the given information, we know:
X ~ N(1000, 180).
Q1: asks for P( X > 1090)
Q2: asks for a value of X, call it: xo, the admission limit, so that
P( X > xo ) = .2
Q3: asks for comparing P(X > 1200) with .1
How do we solve these problems?
•
The probability that a continuous random variable x assumes a value in the
interval from a to b is the area under the probability density function between the
points a and b.
One can use computer such as Minitab, or use a
standardized Z-table.
The Standard Normal Random Variable:
The standardized normal random variable z, is defined as
z = (x  m)/ s , or equivalently, x = m + zs .



The standard probability distribution has a mean of zero and a
standard deviation of 1, that is Z ~ N(0,1)
The area under the standard normal curve between mean z = 0 and a
specified positive value of z, say, z0 , is the probability
Some books use this
P(0  z  z0 )
table. Some use other
type of tables.
0
zo
Z
X
Back to the SAT score problem:
X ~ N(1000, 180)
P( X>1090)
1000 1090
X, SAT score
Z=(x-1000)/180
(1000-1000)/180 = 0
0.5 = (1090-1000)/180
The idea is to transform X~ N(m,s) to Z(0,1) using z = (x-m)/s
P(X > 1090) = P(Z > (1090-1000)/180 ) = P(Z > 0.5)
Now Z-table can be applied.
Example
Find P (0 < z < 1.63)
Solution
1.
Draw a normal curve, shade the area of interest.
2.
Rewrite the question in the way that the Z-table can be
applies. That is in the forms of
P( 0 < Z < zo)
For this example, it is already in this form, so using the Z-table,
we obtain: P (0 < z < 1.63) = .4484.
Some additional exercises:
Find P( Z < 1.96), Find P(-1.24< Z < .68), Find P( Z > -1.64)
Calculating Probabilities for a General Normal
Random Variable, X:
1. Draw a normal curve for X, shade the area of interest,
2. Transform X to Z.
- Standardize the interval of interest, write it as the equivalent
interval in terms of z.
- The probability of interest is the area that you find using the
standard normal probability distribution.
Now, Back to the the SAT example, do the following exercises:
SAT score, X follows a normal distribution with mean 1000 and
s.d., 180. That is, X ~ N(1000, 180)
Find P(X < 800)
Find P(750 < X < 900)
Find P(1180 < X < 1360)
How about the question of determining the SAT admission score
for CMU so that the top 20% will receive admission from
CMU.
Answer: X ~ N(1000, 180). The problem is to find the admission
score, xo so that
P(X > x0) = .2
This is a problem we are looking for a score, not a probability.
We are reversing the problem solving procedure, here.
Similar technique is applied here:
1.
Draw a normal curve, shade the area of interest.
2.
Transform from X to Z.
3.
Rewrite the problem in terms of Z.
4.
Solve for the standardized value, zo using Z-table reversely.
5.
Transform zo back to xo by xo = m + s(zo)
To solve for the admission score xo so that P(X > xo) = .2
Draw the normal curve, shade the area of interest, transform to Z.
.2 = P(X > xo) = P(Z > zo) implies P(0 < Z < zo) = .3
This is a form we can use Z-table.
Looking inside the table, find the closed probability to .3, which is .2995.
By the Z-table, .2995 = P(0 < Z < .84).
Therefore, zo = .84, which is the standardized admission limit.
So, solving for xo, we have xo = m + s(zo) = 1000 + (180)(.84) = 1151.2
The CMU SAT admission limit will be about 1151.2
(In actual application for setting up the policy, we can use 1150 as the new admission
standard.)
Hands-on activities:
Q-a:For the SAT example, X ~ (1000, 180), suppose a university
admits only top 5%. Find their admission limit.
Q-b: Find the 5th percentile of SAT score.
Q-c: Find the Q3 SAT score (75th percentile).
Use Minitab to compute cumulative probabilities and
percentiles for a normal distribution
1.
2.
3.
4.
5.
Go to Calc, choose Probability Distributions, then select
Normal.
In the Dialog box, Density probability = f(x), Cumulative
probability = P( X < a) for any given a, Inverse cumulative
probability is the 100pth percentile, xo , so that P(X < xo) = p.
Choose the one you are computing.
Enter Mean and s.d.. By default, it is N(0,1).
To compute cumulative probability, you need to provide ‘a’
values, which may be created and recorded in a column, e.g.,
C3, or simply to provide the constant ‘a’.
To compute inverse cumulative probability, you need to provide
the cumulative probabilities, which must be in (0,1).
Methods for detecting the discrepancy of the distribution of a
response variable from normal distribution.
Consider the example of Blood Pressure data. From the histogram and
the normal curve imposed onto the histogram using Minitab, we
can see that the blood pressure generally speaking follows a
normal curve. However, there seems to have a few unusually high
blood pressures. The question is ‘How well the blood pressure
follows a normal curve?’.
The imposing normal curve helps us to quickly identify serious
discrepancy from normal. However, if the discrepancy is not very
serious, it is difficult to simply observe the shape of a histogram.
We will discuss three ways for checking the normality of a response:
1.
Imposing normal curve onto the histogram,
2.
Probability plot,
3.
Numerical methods for testing the degree of departure from
normal.
Imposing a normal curve onto a histogram for the blood pressure
data of 1900 young adults between 15-20 years old:
The normal curve indicates there are a
few large blood pressure
measurements. In fact, the descriptive
statistics shows the highest is 210,
which is much higher than 2 s.d. from
the average. It suggests 210 is very
rare. One should check immediately if
there is a typo or not.
Histogram of Systolic Blood Pressure, with Normal
Curve, for indivduals age from 15-20
400
300
Frequency
1.
200
100
0
60
110
160
Systolic Blood Pressure
How to construct this plot using Minitab:
•
Go to Stat, choose Basic Statistics, choose Display Descriptive
Statistics.
•
Enter the variable. Click on the ‘Graphs’ option,
•
In the Graphs option Dialog, you can have a variety of choices. One of
them is Histogram with Normal Curve.
210
2. Normal Probability Plot: It is a two-dimensional plot.

The Y-axis is the estimated cumulative probabilities computed by:
rank  3 / 8
n + 1/ 4


The X-axis is the original data in ascending order.
Diagnosis:
When the data follow a normal curve, the
dotted points should follow a straight line
When data are skewed-to-right, the
plot would look like:
When data are skewed-to-left, the
plot would look like:
Normal Probability Plot for the Blood Pressure Data
.999
.99
.80
0.90000
0.75000
.50
0.50000
.20
0.25000
0.10000
100
132.922
124.235
.001
114.582
.01
104.929
.05
96.241
Probability
.95
150
200
Systolic Blood Pressure
Average: 114.590
StDev: 14.0595
N: 1909
Anderson-Darling Normality Test
A-Squared: 11.502
P-Value: 0.000
Based on the Normal probability plot, it indicates that the systolic
blood pressure does not follow a normal curve. The pattern also
shows that the distribution is somewhat skewed-to-the-right.
3. Test statistic for testing if the blood pressure follows a normal
curve or not.
Graphical methods are good to show the pattern and gives us
pretty clear picture that the data do not follow normal.
Numerically, there are methods that will test such a hypothesis.
The test statistic is given in the same graph of the Normal
Probability Plot.
The Anderson-Darling’s Normality Test is presented here. The
AD-value = 11.5, and the corresponding p-value is .000
 Note: p-value tells us how far the distribution of blood pressure is
away from normal. The smaller the p-value, the less likely the
response variable follows a normal curve. A common cut-off
point is 5%. In this case, p-value = .000, which is clear that the
distribution of Systolic blood pressure does not follow normal.
How to construct a Normal Probability Plot and carry
out the Anderson-Darling’s Normality Test?
1.
2.
3.
Go to Stat, choose Basic Statistics, then select Normality
Test.
In the Dialog, enter variable name.
Reference Probabilities allow us to provide a column of
cumulative probabilities so that the normal probability plot
will show the percentiles for each given cumulative
probability.
•
•
Note: As we have observed that all three methods give us similar
results. Therefore, the systolic blood pressure for 15 to 20 years
old young adults does not follow a normal distribution from the
1909 cases.
Note: Once we find out the distribution is not normal, it is
critical to take some further analysis:
– carefully check the data to see if there are any typos,
– Examine the data using some descriptive measures or other plots to
identify extreme cases (Details will be discussed in another module).
Hands-on Activity:
Use the above three methods to check the distribution of
Diastolic Blood Pressure data.
Actions to deal with extreme cases
For observational studies (such as survey):
• The sample sizes are usually large, and that it is often impossible to find out
possible causes that resulted the extreme data after the data are collect.
Therefore, it is critical to collect background and environmental variables that
may have potential impact to the results.
For experimental studies, such as inter-laboratory testing:
• It is important to look for possible causes that resulted the extremes. The study
is usually conducted under a controlled experimental environment. It is more
likely to find out causes for the extremes, or be able to explain the possible
causes.
Deletion of extremes Vs. Making transformation to normal
One must be careful of deleting extremes. Especially when we are not able to find
any causes and the values are reasonable within the context of the study.
This may be an indication that the distribution of the response is skewed. For
situations such as this, an appropriate approach is to transform the data to be closer to
normal.
Method for transforming a variable to normal
When the data show a skewed distribution, statistical methods such as
Analysis of Variance may not be valid. An approach is to make a
mathematical transformation of the variable so that the transformed
variable will be closer to normal.
Some tips for variable transformation:
 If variable, Y, is skewed-to-right: Then, ln(Y), log10(Y), or
Y
will be closer of normal. (If there are zero’s, add each data value by
.5, first.
 If variable, Y, is skewed-to-left: ln(1/Y), log10(1/Y),
1/ Y
 or Ya, a >1 will be closer to normal.
An example of Transformation:
The life time of 50 light bulbs are tested by letting them on all the time until it burns out. The
data recorded (in months). Here are the histogram and the normal probability test of the raw
data, the ln transformed data and Square-root transformed data:
Histogram of Sqrt(Y)
Histogram of Life Time
9
10
8
6
Frequency
Frequency
7
5
4
5
3
2
1
0
0
1.0
0
10
1.5
2.0
2.5
20
3.0
3.5
4.0
4.5
5.0
Sqrt(Y)
Life Time
Histogram of Ln(Y)
The raw data is skewed-to-right.
The Square-root transformation works well.
Frequency
The Ln transformation does not work well.
10
5
0
-0.4
0.0
0.4
0.8
1.2
1.6
Ln(Y)
2.0
2.4
2.8
3.2
The normal probability plots and Anderson-Darling’s tests for the
life-time data:
Normal Probability Plot for Sqrt(Y)
Normal Probability Plot for the Life Time Data
.999
.99
.999
.95
Probability
.99
Probability
.95
.80
.50
.80
.50
.20
.05
.20
.01
.05
.001
.01
1
.001
2
3
4
5
Sqrt(Y)
0
10
20
Average: 2.86016
StDev: 1.09027
N: 50
Anderson-Darling Normality Test
A-Squared: 0.430
P-Value: 0.297
Life Time
Average: 9.34544
StDev: 6.29248
N: 50
Anderson-Darling Normality Test
A-Squared: 0.906
P-Value: 0.019
Normal Probability Plot for Ln(Y)
.999
.99
As the normal probability plots and
the Normality test results indicate, the
Sqrt(Y) is approximately normal. The
other two are not.
Probability
.95
.80
.50
.20
.05
.01
.001
0
1
2
3
Ln(Y)
Average: 1.93005
StDev: 0.886131
N: 50
Anderson-Darling Normality Test
A-Squared: 1.071
P-Value: 0.007
Hands-on Activity
Analyze the distribution of variable GR36-Lab-Mean-1
in the TAPPI inter-laboratory testing study, and
determine an appropriate transformation to make the
data closer to a normal distribution.
Download