Continuous Random Variables

advertisement
8 - Continuous Random Variables and the Normal Distribution
Motivating Problem: Diagnosing Spina Bifida
The procedure of amniocentesis involves drawing a sample of the amniotic fluid that surrounds an unborn
child in its mother’s womb. If the concentration of alpha fetoprotein is high this can indicate that the child
has the condition spina bifida which can be very serious. However the concentration of alpha fetoprotein
tends to increase with the size of the foetus. Amniocentesis is not without risks because it results in
miscarriage for 1% of those who have it, so preliminary tests involve measuring the level of alpha
fetoprotein in the mother’s urine. For mothers with normal foetuses the mean level of alpha fetoprotein is
15.73  moles/liter with a standard deviation of 0.72  moles/liter. For mothers carrying foetuses with spina
bifida the mean is 23.05 and the standard deviation is 4.08. In both groups the distribution of alpha
fetoprotein appears to be approximately Normally distributed. To operate a diagnostic test for spina bifida,
medical professionals must set a threshold concentration of alpha fetoprotein, T, say. If the alpha fetoprotein
level is below T, the foetus is diagnosed as not having spina bifida, whereas if the level is above T further
testing is required. If T was set at 17.80  moles/litre:
15.73
23.05
What is the probability that a foetus with spina bifida is correctly diagnosed?
What is the probability that a foetus not suffering from spina bifida is correctly diagnosed?
If they wanted to ensure that 99% of foetuses with spina bifida were correctly diagnosed, at what
level should they set T? What are the implications of setting T as this level?
Our goal is to be able to answer these questions assuming that the alpha fetoprotein levels for both groups of
foetuses are approximately normally distributed, i.e. follow bell-shape curves as shown in the diagram
above.
CONTINUOUS RANDOM VARIABLES
If a random variable, X, can take any value in some interval of the real line it is called a
__________________________
Examples:
50
THE STANDARDIZED HISTOGRAM or DENSITY SCALING
Example: Dietary Carbohydrate: The average daily intake of carbohydrate (g/day) in the diet found for a
sample n=5929 people.
The histogram of the data given in Graph (a) shows the carbohydrate intake. From this we see that:
*
*
*
(b ) A re a b e tw e e n a = 2 2 5
an d b = 3 7 5 sh ad ed
( a ) S ta n d a r d iz e d h is to g ra m
S h a d e d a re a = .4 8 3
.0 0 4
.0 0 4
.0 0 2
.0 0 2
.0 0 0
0
200
400
600
C a rb o h y d ra te (g /d a y )
.0 0 0
800
(C o rre s p o n d s to 4 8 .3 %
o f o b s e rv a tio n s )
0
600
800
S h a d e d a re a = .4 8 6
.004
.002
37 5
(d ) A re a b e tw e e n a = 2 2 5
an d b = 3 7 5 sh ad ed
(c ) W ith a p p ro x im a tin g c u rv e
.0 0 4
225
(c f. a re a = .4 8 3
f o r h is to g r a m )
.002
0
200
400
C a r b o h y d r a te
600
( g/d a y )
800
0
600
225
800
375
Note: Histograms are usually drawn with the height of the rectangle for the ith interval being the frequency,
(representing the count) or the relative frequency (representing the proportion).
The standardized histogram adjusts the height of the rectangle to ____________________
__________________________________________________.
i.e. The area of the ith rectangle tells us what proportion of the data lie in the ith class interval.
For a standardized histogram:
The vertical scale is : Relative frequency/interval width  this is called the density scale.
Total area under the histogram = ______
The proportion of the data between a and b is __________________________________________
Carbohydrate Example (cont’d)
The proportion of people with carbohydrate intakes between 225 and 375 g/day is the shaded area in the
histogram in Graph (b) (= 0.483 or 48.3%)
Graph (c) shows an approximating smooth curve on the standardized histogram and on Graph (d) the area
from 225 to 375 is shaded. This area is calculated to be 0.486 and is very close to the proportion of people
who had carbohydrate intakes of between 225 and 375 g/day.
51
Example 2: Cell Radii of Malignant Breast Tumors (see Assignment #1)
In JMP we can add a density scale axis to a histogram in JMP select Histogram Options > Density Axis.
The histogram on the left is for cell radii of malignant tumor cells
from the fine needle aspirations in the breast cancer study on
your first assignment. Here the class intervals all have width 1 so
the area of any bar is simply the density axis value, hence the
heights of the bars here represent empirical probabilities. For
example if we let
X = radius of a randomly selected malignant tumor cell
we estimate that the P(14 < X < 15)  .10 or a10% chance.
Example 3: Spina Bifida (SpinaBifida.JMP in Read Only folder)
To obtain a standardized histogram in JMP select Histogram Options > Density Axis. This is a histogram of
the alpha fetoprotein levels of women who are carrying a foetus with spina bifida.
The shaded area in the histogram above is 2.5 (width of the interval) times .10 (height of the bar in the
standardized scale) which is .25. This says that the estimated probability that the alpha fetoprotein level
found in the urine of a mother carrying a foetus with spina bifida lies between 22.5  moles/liter and 25 
moles/liter is .25, or a 25% chance.
If we define X = alpha fetoprotein level found in the urine of mothers carrying a foetus with spina bifida
we can say the following:
P(22.5  X  25)  .25 or 25% chance
Note: Examination of the spreadsheet confirms that the number of observations highlighted is exactly 25 of the 100 observations.
DENSITY/SMOOTH CURVES
Take a standardized histogram, decrease the width of the class intervals and increase the number of
observations. Then the top of the histogram tends to a smooth curve.
n = 100
52
n = 500
n = 10000
n =100000
n = 1,000,000
0.08
0.05
Density
0.10
0.03
10
20
30
40
The limiting smooth curve can be described using a function called the probability density function.
To obtain a sampled-based estimate of this function in JMP select Analyze > Distribution > Fit
Distribution > Smooth Curve. As n increases, as shown above the histogram itself “converges” to the
probability density function.
PROPERTIES OF THE PROBABILITY DENSITY FUNCTION (p.d.f.),
1.
f(x)
(i.e. the p.d.f. curve stays above the x-axis)
2.
Pa  X  b =
3.
Area under the p.d.f. curve =
ENDPOINTS OF INTERVALS
For a continuous random variable, X, endpoints of intervals are ___________________
Pa  X  b =
(Inclusion or exclusion of the endpoints will not change the area.)
53
THE NORMAL DISTRIBUTION
Examples: Alpha fetoprotein levels of mothers carrying a foetus with spina bifida.
Limiting distribution that is a smooth bell shaped symmetric curve is
called the Normal p.d.f. curve or just the Normal curve.
50% 50%
Mean
If a random variable, X, has a Normal distribution with a mean  and a standard deviation  we write:
The Normal distribution is important because:

it fits a lot of data reasonably well;

it can be used to approximate other distributions (e.g. binomial)

it is important in statistical inference (see later work.)
Shape is solely determined by  and  , the population mean  controls where the normal is centered, and
the population standard deviation  controls the spread about  .
Example: Alpha fetoprotein levels found in the urine of mothers carrying a foetus with spina bifida.
Let X = alpha fetoprotein level in the urine of a mother carrying a foetus with spina bifida.
The mean AFP level is _________  moles/liter and the standard deviation is __________  moles/liter.
EMPIRICAL RULE
Approximately _______ % of the mothers in this population will have AFP levels within 1 standard
deviation of the mean, i.e. we estimate that approximately ________% of this population of mothers will
have AFP levels:
between _______________ and
________________
= between _______________ and
________________
.
Approximately _______ % of the mothers in this population will have AFP levels within 2 standard
deviation of the mean, i.e. we estimate that approximately ________% of this population of mothers will
have AFP levels:
between _____________ and
______________
= between _____________ and
______________
Approximately _______ % of the mothers in this population will have AFP levels within 3 standard
deviation of the mean, i.e. we estimate that approximately ________% of this population of mothers will
have AFP levels:
between _____________ and
________________
= between ______________ and ________________
54
For the Normal Distribution:
68% chance of falling within 1  of  ;
95% chance of falling within 2  of  ;
99.7% chance of falling within 3  of  .
A random observation has approximately:
or
68% of observations are within 1  of  ;
95% of observations are within 2  of  ;
99.7% of observations are within 3  of  .
In a Normal distribution, approximately:
OBTAINING OTHER PROBABILITES ASSOCIATED WITH A NORMAL DISTRIBUTION
Normal distribution probabilities can be obtained from all statistical packages by giving the mean and
standard deviation of the distribution. (see Normal Probability Calculator in Tutorials section of website)
Most table and computer packages calculate the value of P(X  x). i.e. cumulative or lower tail probabilities.
Area = P(X  x)
OR
Area = P(X  x)
x
Standardization and the Standard Normal Distribution
Fact:
If X ~ N(  ,  ) then if we define a new random variable Z 
x
X 
then Z ~ N(0,1)

i.e. we create a new random variable Z where the observed values of Z are the z-scores for the random
variable X. The process of converting a random variable X to z-scores is called standardization.
Basic method for obtaining probabilities
1.
Sketch a Normal curve, marking the mean and the value(s) of interest.
2.
Shade the area under the curve corresponding to the required probability.
3.
Convert all values in original scale to their corresponding z-scores.
4.
Obtain the desired probability from the lower-tail areas provided by the standard normal table found
in the front inside cover of the text.
55
Find the following standard normal probabilities using the Standard Normal Table
a) P(Z < .67)
b) P(Z > 2.25)
c) P(Z > 3.00)
e) P(Z < -2.33)
f) P(-1.96 < Z < 1.96)
h) Find z so that P(Z < z) = .90, i.e. what is the 90th percentile of the standard normal distribution?
Spina Bifida Example (continued)
X = AFP level of a randomly selected mother carrying a foetus with spina bifida . Lets assume that
X~Normal ( =23.05,  = 4.08) using the sample mean and sample standard deviation.
Find the following:
a) P(X < 15.00)
=
56
b) P(X < 27.00)
c) P(X > 17.0)
d) Find the 90th percentile.
e) Find the 25th percentile
Original Problem: Spina bifida
15.73
23.05
Recall: For normal foetuses  =15.73,  = 0.72 and for foetuses with spina bifida  = 23.05 and  = 4.08.
Assume the threshold for detecting spina bifida is set at 17.8. (A foetus would be diagnosed as not having
spina bifida if the fetoprotein level is below 17.8)
a)
What is the probability that a foetus not suffering from spina bifida is correctly diagnosed?
b)
What is the probability that a foetus with spina bifida is correctly diagnosed?
c)
If they wanted to ensure that 99% of foetuses with spina bifida were correctly diagnosed, at what level
should they set T ?
57
Download