Here - BCIT Commons

advertisement
MATH 2441
Probability and Statistics for Biological Sciences
Percentiles of the Normal Random Variable
The word percentile is used in exactly the same sense as it was defined earlier in the course. Thus, if x is a
random variable with some probability density function, f(x), then the pth percentile of x is the value of x
which divides the total area under the probability density curve into a left-side region of area p/100 and a
right-side region of area (100 - p)/100:
area is (100 -p)/100
area is p/100
x
pth percentile of x
The only slightly strange twist is in the notation used for the pth percentile: it consists of a subscript that
gives the value of 1-p/100 (which is the area of the right-side region bounded by the pth percentile). Thus,
x  the value of x which cuts off a right-hand tail of area  under its probability density curve
 the 100(1 - )th percentile of x
Before the course is over, you should see that this subscript notation makes some sense in the situations in
which these symbols are most commonly used. For the moment, just accept the statement that the notation
x or z as defined just above will be useful. While this notation seems to focus on right-hand tail areas, the
phrase "pth percentile" still refers strictly to the definition illustrated in the figure above.
Thus, for example, the 80th percentile of a random variable x is the value of x that divides the region under
its probability density curve into a left part of area 0.80 and a right part of area 0.20. However, notationally,
this 80th percentile value would be written as x0.20.
Having defined what is meant by a percentile of a random variable and the notation x, the next step is to
develop methods for computing these values. We have the tools to do so when x is a normally distributed
random variable.
Percentiles of the Standard Normal Random Variable
Finding percentiles of z amounts to the reverse of finding probabilities for z. We start with an area (which is
equivalent to a probability) and work backwards to a value of z. Part of the process may involve relating the
initial area to the area of a region of the form 0  z  b, to match the areas represented in our version of the
standard normal probability table. We'll illustrate the
total area must be 0.80
procedure with a number of examples.
Example 1: Find the 80th percentile of z.
area = 0.50
area between
z = 0 and z = z0.20
must be
0.80 - 0.50 = 0.30
Solution
We need to find the value of z which cuts of a left-hand tail of
area 0.80 (or equivalently, a right-hand tail area of 0.20).
z
The situation is sketched in the figure. First, we know that
the 80th percentile of z must be a value to the right of z = 0,
because the area to its left is 0.80 which is greater than 0.5.
David W. Sabo (1999)
Percentiles of Normal Random Variables
th
80 percentile of z
or z0.20
Page 1 of 6
This tells us that the region to the left of the 80th percentile must include the entire left-half of the distribution
and then some.
Secondly, as illustrated in the figure, we can easily compute that the area for the region 0  z  z0.20 is 0.30.
(Here we are using the equivalent notation: the 80 th percentile of z  z0.20.) Thus, z0.20 must be the value of
z corresponding to the entry 0.3000 in our standard normal probability table.
However, when we look at the table, we don't find a probability entry which is exactly 0.3000. The closest
we come is


z = 0.84
z = 0.85
table entry = 0.2995
table entry = 0.3023
Many people would just pick the closest value  in this case 0.2995 is closer to 0.3000 than is 0.3023 
and so would state in answer to this problem: the 80 th percentile of z is approximately 0.84 (or equivalently,
z0.20  0.84).
Others might employ linear interpolation, which in this case would give
z 0.20  0.84  0.01
0.3000  0.2995
 0.84  0.00179  0.842
0.3023  0.2995
to add a single additional decimal place to the value of z readable from the table. This amounts to saying
that the entry 0.3000, if present, would lie about 20% of the way between the entries for z = 0.84 and for
z = 0.85. The value 0.3000 is about 20% of the way between 0.2995 and 0.3023.
In hand calculations in this course, we recommend the first approach  just pick the closest table entry. If
you are using a computer program to generate percentile values, the program will usually give you whatever
precision you wish (or is warranted) automatically, so no further refinement would be necessary.
Interpolation is just a strategy to partially compensate for the limitations on table sizes.

Example 2: Find the 26th percentile of z.
area
= 0.5 - 0.26
= 0.24
Solution
area = 0.26
To make the display of work easier, represent the value of the
26th percentile of z by the symbol c. Then, if c is the 26th
percentile of z, we know that c must lie to the left of z = 0, as
sketched in the figure to the right. (Percentiles of z less than
the 50th are always to the left of z = 0, and percentiles of z
greater than the 50th are always to the right of z = 0.)
z
c 0
26th percentile
of z
So, c is a negative value, and we know that it must be chosen so that
Pr(z  c) = 0.26.
This means that the probability of z being between 0 and c is 0.24. However, 0  c is not a region
represented in our standard normal probability tables because c is negative. However,
Pr(c  z  0) = Pr(0  z  -c)
and Pr(0  z  -c) can be read from out standard normal probability tables because -c is a positive number.
The entry in the table closest to 0.2400 is 0.2389, which corresponds to z = 0.64. Thus
-c  0.64

c  -0.64
So, we conclude that the 26th percentile of z is approximately -0.64.

Page 2 of 6
Percentiles of Normal Random Variables
David W. Sabo (1999)
Those first two examples are the only variations possible on the simple percentile problem. However, there
are similar types of problems which, while not strictly simple percentile calculations, still require working
backwards from a probability to a value (or values) of the original random variable. We illustrate several of
this type of problem for the standard normal random variable.
Example 3: Find the value of z which cuts off a right hand tail of area 0.65.
Solution
area = 0.65
Let this unknown value of z be represented by the symbol c.
As seen from the sketch, c must be a negative value. Further,
since
Pr(c  z  0) = 0.65 - 0.50 = 0.15,
z
we also know that
c
Pr(0  z  -c) = 0.15
The standard normal probability table entry nearest in value to 0.1500 is 0.1517, corresponding to z = 0.39.
Thus,
-c  0.39

c  -0.39
We conclude that it is the value z  -0.39 which cuts of a right-hand tail of area 0.65.

Example 4: Find the value of z which cuts of a right-hand tail of area 0.02.
Solution
The situation is shown, somewhat exaggerated, in the sketch
to the right. Again, representing the desired value by the
symbol c, we see that we are looking for a positive value,
probably in the vicinity of 2 or larger (the small area of 0.02
would take in the extremes of the visible tail region). Clearly,
then
area = 0.02
z
c
Pr(0  z  c) = 0.48
The value closest to 0.4800 in the body of our standard normal probability tables is 0.4798, corresponding to
z = 2.05. Thus, we conclude that the value of z cutting off a right-hand tail of area 0.02 is approximately
2.05.

Example 5: Find the value of c which satisfies
area = 0.95
Pr(-c  z  c) = 0.95
area = 0.025
area = 0.025
Solution
We are being asked to find values of z bounding the central
region of area 0.95, as sketched in the figure to the right.
The central shaded region consists of two mirror-image
halves because of the symmetry of the z-distribution, and so
we can say that
z
-c
c
Pr(0  z  c) = 0.95/2 =0.475
David W. Sabo (1999)
Percentiles of Normal Random Variables
Page 3 of 6
The entry in the z-table which is closest to 0.4750 is exactly 0.4750, corresponding to z = 1.96. Thus, we
conclude that the value of c satisfying the original condition is c = 1.96.

Example 6: Find the value of c which satisfies
Pr(|z| > c) = 0.05
Solution
The vertical bars surrounding the symbol z here stand for 'absolute value', meaning 'ignore the sign of z
when comparing to c.' The condition |z| > c really represents two conditions:
z positive:
z negative:
|z| > c
|z| > c


z>c
z < -c
Thus, |z| > c represents a two-tailed region, as shown in the
sketch to the right. By symmetry, both regions have the
same area, and so we can write
Pr(z > c) = 0.05/2 = 0.025
z
But, this means that
Pr(0  z  c) = 0.5 - 0.025 = 0.4750.
-c
c
Thus, as in Example 5, since consulting the z-table indicates the entry 0.4750 corresponds to z = 1.96, we
conclude that the required value for c here is 1.96.

Percentiles of General Normal Distributions
To compute percentiles of any normally distributed random variable, x, you just first calculate the
corresponding percentiles for the standard normal random variable, z, and the compute the corresponding
value or values of x using the usual formula:
x =  + z
where  and  are the mean and standard deviation of x, respectively.
Example 7: An approximately normally distributed random variable, x, has a mean of 375 g and a standard
deviation of 43 g. Find its 80th and 10th percentiles. Also, determine x0.35.
Solution
To determine the 80th percentile of x, we first determine the 80th percentile of z. In fact, we've already done
this above in Example 1, obtaining the result z = 0.84. Thus, the 80 th percentile of x is just
375 + (0.84)(43) = 375 + 36.12  411 g.
Similarly, to compute the 10th percentile of x, we first determine that the 10th percentile of z is -1.28 (from the
value of the z-table closest to 0.4000). Thus, the 10th percentile of x is
375 + (-1.28)(43) = 375-55.04  320 g
Finally, x0.35 is the value of x that separates the top 35% of all possible x-values from the bottom 65%. The
value of z which does this for the z-distribution is 0.39, and so the value of x that will do this for the xdistribution is
Page 4 of 6
Percentiles of Normal Random Variables
David W. Sabo (1999)
375 + (0.39)(43) = 375 + 16.77  392 g.

You will see later in the course that you need to calculate percentiles of the z-distribution and other normal
distributions as part of many methods of statistical inference. However, there are also some more
immediate applications of these calculations. We illustrate with several examples.
Example 8: A study is done of technicians who routinely scan tissue slides for signs of anomalous cell
structures with a view towards eventually determining if certain techniques are more efficient than others.
One early result is that among technicians using one approach, the number of slides correctly scanned per
hour is an approximately normally distributed random variable with a mean of 38.4 and a standard deviation
of 6.3. How many slides would you have to average in one hour to be in the fastest ten percent of such
technicians?
Solution
To be in the fastest 10% of such technicians is the same
thing as saying your rate of scanning slides is the 90th
percentile for the distribution of hourly slide numbers. So,
this question is really asking us to find the 90th percentile of
slowest 90%
fastest 10%
area = 0.10
x = number of slides scanned per hour
where x is an approximately normally distributed random
variable with a mean of 38.4 and a standard deviation of
6.3.
x
x0.10
To do this, we first need the 90th percentile of z. Consulting the z-table (and using the approach illustrated in
Example 1 above), we find that the probability entry closest in value to 0.4000 is 0.3997, corresponding to
z = 1.28. Thus, the 90th percentile of x is
38.4 + (1.28)(6.3) = 46.401
Thus, to be in the fastest 10% of all slide scanners, you would have to average approximately 46.4 slides
per hour or more.
(Note: even without doing any calculations, you know that the fastest 10% of the technicians must be
scanning substantially more than the average, 38.4, slides per hour. If your answer had come out to be less
than 38.4, you should recognize immediately that a blunder has been made.)

Example 9: A biotechnologist is studying the course of a human viral infection. Although some infected
individuals seem to recover quite quickly, others take a much longer time to recover. Her data suggests that
the number of days between first onset of symptoms and the complete disappearance of viral particles is an
approximately normally distributed random variable with a mean of 23.6 days and a standard deviation of
9.2 days. As a first step in trying to detect characteristics which might promote faster recovery, she decides
to look in more detail at the 20% of infected individuals who recovered most quickly. Determine the
maximum number of days to recovery for individuals in the 20% were fastest to recover.
Solution:
Fastest recovery means shortest time to recovery. Thus, if x is defined to be the number of days from onset
of symptoms until becoming virus free, we are required to find the 20 th percentile of x  the value which
separates possible observations of x into the smallest 20% and the largest 80%. Now, the 20 th percentile of
z is -0.84. Since x is an approximately normally distributed random variable with mean of 23.6 days and
standard deviation of 9.2 days, the 20th percentile of x is
23.6 + (-0.84)(9.2) = 15.872  16 days.
So, her detailed study should include those persons who recovered in 16 or fewer days.

David W. Sabo (1999)
Percentiles of Normal Random Variables
Page 5 of 6
Example 10: A food technologist is involved in a project to evaluate a proposed fat substitute. One of the
questions to be answered is the effect of the material on blood pressure levels. To establish a baseline, she
obtains pre-ingestion blood pressure readings for a large number of participants in the study, and concludes
from her data that systolic blood pressure levels in the population appears to be an approximately normally
distributed random variable with a mean of 115 mm and a standard deviation of 11 mm. Describe the
systolic blood pressures of those individuals who are in the 10% of the population with blood pressures most
different from the mean.
Solution
Your systolic blood pressure can differ from the population mean by either being lower than the mean or by
being higher than the mean. Since the population distribution here is approximately normal, the 10% most
different from the mean will be the 5% lowest and the 5% highest. Thus, defining
x = systolic blood pressure reading for a randomly selected individual
this problem is asking for the same sort of results for x
as Example 6 above determined for z: that is, find the
value of c for which
combined area = 0.10
Pr(|x - x| > c) = 0.10
Using the principle that we first solve this problem for
z and then transform the z-values back to the
corresponding x-values, we recall that this type of z
x
problem has already been solved in Example 6.
µx µx + c
µx - c
Following the procedure illustrated in that example,
we get that the z-values bounding two identical tails of
total area 0.10 are 1.645 (this is one of the few z-percentiles where people often interpolate, because the
two table entries bracketing 0.4500 are equidistant from this value). From this, we obtain the results that
those individuals with systolic blood pressure readings less than 115 - (1.645)(11) = 96.905 mm, and those
with systolic blood pressure readings greater than 115 + (1.645)(11) = 133.095 mm form the 10% of the
population with readings most different from the mean.

If you are formulating statistical calculations in a Microsoft Excel spreadsheet, you should use the function
NORMSINV() to compute percentiles for the standard normal distribution, and NORMINV() to compute
percentiles for general normal distributions. Details on the usage of these functions is available via the
paste function tool.
Page 6 of 6
Percentiles of Normal Random Variables
David W. Sabo (1999)
Download