MATH 2441 Probability and Statistics for Biological Sciences Percentiles of the Normal Random Variable The word percentile is used in exactly the same sense as it was defined earlier in the course. Thus, if x is a random variable with some probability density function, f(x), then the pth percentile of x is the value of x which divides the total area under the probability density curve into a left-side region of area p/100 and a right-side region of area (100 - p)/100: area is (100 -p)/100 area is p/100 x pth percentile of x The only slightly strange twist is in the notation used for the pth percentile: it consists of a subscript that gives the value of 1-p/100 (which is the area of the right-side region bounded by the pth percentile). Thus, x the value of x which cuts off a right-hand tail of area under its probability density curve the 100(1 - )th percentile of x Before the course is over, you should see that this subscript notation makes some sense in the situations in which these symbols are most commonly used. For the moment, just accept the statement that the notation x or z as defined just above will be useful. While this notation seems to focus on right-hand tail areas, the phrase "pth percentile" still refers strictly to the definition illustrated in the figure above. Thus, for example, the 80th percentile of a random variable x is the value of x that divides the region under its probability density curve into a left part of area 0.80 and a right part of area 0.20. However, notationally, this 80th percentile value would be written as x0.20. Having defined what is meant by a percentile of a random variable and the notation x, the next step is to develop methods for computing these values. We have the tools to do so when x is a normally distributed random variable. Percentiles of the Standard Normal Random Variable Finding percentiles of z amounts to the reverse of finding probabilities for z. We start with an area (which is equivalent to a probability) and work backwards to a value of z. Part of the process may involve relating the initial area to the area of a region of the form 0 z b, to match the areas represented in our version of the standard normal probability table. We'll illustrate the total area must be 0.80 procedure with a number of examples. Example 1: Find the 80th percentile of z. area = 0.50 area between z = 0 and z = z0.20 must be 0.80 - 0.50 = 0.30 Solution We need to find the value of z which cuts of a left-hand tail of area 0.80 (or equivalently, a right-hand tail area of 0.20). z The situation is sketched in the figure. First, we know that the 80th percentile of z must be a value to the right of z = 0, because the area to its left is 0.80 which is greater than 0.5. David W. Sabo (1999) Percentiles of Normal Random Variables th 80 percentile of z or z0.20 Page 1 of 6 This tells us that the region to the left of the 80th percentile must include the entire left-half of the distribution and then some. Secondly, as illustrated in the figure, we can easily compute that the area for the region 0 z z0.20 is 0.30. (Here we are using the equivalent notation: the 80 th percentile of z z0.20.) Thus, z0.20 must be the value of z corresponding to the entry 0.3000 in our standard normal probability table. However, when we look at the table, we don't find a probability entry which is exactly 0.3000. The closest we come is z = 0.84 z = 0.85 table entry = 0.2995 table entry = 0.3023 Many people would just pick the closest value in this case 0.2995 is closer to 0.3000 than is 0.3023 and so would state in answer to this problem: the 80 th percentile of z is approximately 0.84 (or equivalently, z0.20 0.84). Others might employ linear interpolation, which in this case would give z 0.20 0.84 0.01 0.3000 0.2995 0.84 0.00179 0.842 0.3023 0.2995 to add a single additional decimal place to the value of z readable from the table. This amounts to saying that the entry 0.3000, if present, would lie about 20% of the way between the entries for z = 0.84 and for z = 0.85. The value 0.3000 is about 20% of the way between 0.2995 and 0.3023. In hand calculations in this course, we recommend the first approach just pick the closest table entry. If you are using a computer program to generate percentile values, the program will usually give you whatever precision you wish (or is warranted) automatically, so no further refinement would be necessary. Interpolation is just a strategy to partially compensate for the limitations on table sizes. Example 2: Find the 26th percentile of z. area = 0.5 - 0.26 = 0.24 Solution area = 0.26 To make the display of work easier, represent the value of the 26th percentile of z by the symbol c. Then, if c is the 26th percentile of z, we know that c must lie to the left of z = 0, as sketched in the figure to the right. (Percentiles of z less than the 50th are always to the left of z = 0, and percentiles of z greater than the 50th are always to the right of z = 0.) z c 0 26th percentile of z So, c is a negative value, and we know that it must be chosen so that Pr(z c) = 0.26. This means that the probability of z being between 0 and c is 0.24. However, 0 c is not a region represented in our standard normal probability tables because c is negative. However, Pr(c z 0) = Pr(0 z -c) and Pr(0 z -c) can be read from out standard normal probability tables because -c is a positive number. The entry in the table closest to 0.2400 is 0.2389, which corresponds to z = 0.64. Thus -c 0.64 c -0.64 So, we conclude that the 26th percentile of z is approximately -0.64. Page 2 of 6 Percentiles of Normal Random Variables David W. Sabo (1999) Those first two examples are the only variations possible on the simple percentile problem. However, there are similar types of problems which, while not strictly simple percentile calculations, still require working backwards from a probability to a value (or values) of the original random variable. We illustrate several of this type of problem for the standard normal random variable. Example 3: Find the value of z which cuts off a right hand tail of area 0.65. Solution area = 0.65 Let this unknown value of z be represented by the symbol c. As seen from the sketch, c must be a negative value. Further, since Pr(c z 0) = 0.65 - 0.50 = 0.15, z we also know that c Pr(0 z -c) = 0.15 The standard normal probability table entry nearest in value to 0.1500 is 0.1517, corresponding to z = 0.39. Thus, -c 0.39 c -0.39 We conclude that it is the value z -0.39 which cuts of a right-hand tail of area 0.65. Example 4: Find the value of z which cuts of a right-hand tail of area 0.02. Solution The situation is shown, somewhat exaggerated, in the sketch to the right. Again, representing the desired value by the symbol c, we see that we are looking for a positive value, probably in the vicinity of 2 or larger (the small area of 0.02 would take in the extremes of the visible tail region). Clearly, then area = 0.02 z c Pr(0 z c) = 0.48 The value closest to 0.4800 in the body of our standard normal probability tables is 0.4798, corresponding to z = 2.05. Thus, we conclude that the value of z cutting off a right-hand tail of area 0.02 is approximately 2.05. Example 5: Find the value of c which satisfies area = 0.95 Pr(-c z c) = 0.95 area = 0.025 area = 0.025 Solution We are being asked to find values of z bounding the central region of area 0.95, as sketched in the figure to the right. The central shaded region consists of two mirror-image halves because of the symmetry of the z-distribution, and so we can say that z -c c Pr(0 z c) = 0.95/2 =0.475 David W. Sabo (1999) Percentiles of Normal Random Variables Page 3 of 6 The entry in the z-table which is closest to 0.4750 is exactly 0.4750, corresponding to z = 1.96. Thus, we conclude that the value of c satisfying the original condition is c = 1.96. Example 6: Find the value of c which satisfies Pr(|z| > c) = 0.05 Solution The vertical bars surrounding the symbol z here stand for 'absolute value', meaning 'ignore the sign of z when comparing to c.' The condition |z| > c really represents two conditions: z positive: z negative: |z| > c |z| > c z>c z < -c Thus, |z| > c represents a two-tailed region, as shown in the sketch to the right. By symmetry, both regions have the same area, and so we can write Pr(z > c) = 0.05/2 = 0.025 z But, this means that Pr(0 z c) = 0.5 - 0.025 = 0.4750. -c c Thus, as in Example 5, since consulting the z-table indicates the entry 0.4750 corresponds to z = 1.96, we conclude that the required value for c here is 1.96. Percentiles of General Normal Distributions To compute percentiles of any normally distributed random variable, x, you just first calculate the corresponding percentiles for the standard normal random variable, z, and the compute the corresponding value or values of x using the usual formula: x = + z where and are the mean and standard deviation of x, respectively. Example 7: An approximately normally distributed random variable, x, has a mean of 375 g and a standard deviation of 43 g. Find its 80th and 10th percentiles. Also, determine x0.35. Solution To determine the 80th percentile of x, we first determine the 80th percentile of z. In fact, we've already done this above in Example 1, obtaining the result z = 0.84. Thus, the 80 th percentile of x is just 375 + (0.84)(43) = 375 + 36.12 411 g. Similarly, to compute the 10th percentile of x, we first determine that the 10th percentile of z is -1.28 (from the value of the z-table closest to 0.4000). Thus, the 10th percentile of x is 375 + (-1.28)(43) = 375-55.04 320 g Finally, x0.35 is the value of x that separates the top 35% of all possible x-values from the bottom 65%. The value of z which does this for the z-distribution is 0.39, and so the value of x that will do this for the xdistribution is Page 4 of 6 Percentiles of Normal Random Variables David W. Sabo (1999) 375 + (0.39)(43) = 375 + 16.77 392 g. You will see later in the course that you need to calculate percentiles of the z-distribution and other normal distributions as part of many methods of statistical inference. However, there are also some more immediate applications of these calculations. We illustrate with several examples. Example 8: A study is done of technicians who routinely scan tissue slides for signs of anomalous cell structures with a view towards eventually determining if certain techniques are more efficient than others. One early result is that among technicians using one approach, the number of slides correctly scanned per hour is an approximately normally distributed random variable with a mean of 38.4 and a standard deviation of 6.3. How many slides would you have to average in one hour to be in the fastest ten percent of such technicians? Solution To be in the fastest 10% of such technicians is the same thing as saying your rate of scanning slides is the 90th percentile for the distribution of hourly slide numbers. So, this question is really asking us to find the 90th percentile of slowest 90% fastest 10% area = 0.10 x = number of slides scanned per hour where x is an approximately normally distributed random variable with a mean of 38.4 and a standard deviation of 6.3. x x0.10 To do this, we first need the 90th percentile of z. Consulting the z-table (and using the approach illustrated in Example 1 above), we find that the probability entry closest in value to 0.4000 is 0.3997, corresponding to z = 1.28. Thus, the 90th percentile of x is 38.4 + (1.28)(6.3) = 46.401 Thus, to be in the fastest 10% of all slide scanners, you would have to average approximately 46.4 slides per hour or more. (Note: even without doing any calculations, you know that the fastest 10% of the technicians must be scanning substantially more than the average, 38.4, slides per hour. If your answer had come out to be less than 38.4, you should recognize immediately that a blunder has been made.) Example 9: A biotechnologist is studying the course of a human viral infection. Although some infected individuals seem to recover quite quickly, others take a much longer time to recover. Her data suggests that the number of days between first onset of symptoms and the complete disappearance of viral particles is an approximately normally distributed random variable with a mean of 23.6 days and a standard deviation of 9.2 days. As a first step in trying to detect characteristics which might promote faster recovery, she decides to look in more detail at the 20% of infected individuals who recovered most quickly. Determine the maximum number of days to recovery for individuals in the 20% were fastest to recover. Solution: Fastest recovery means shortest time to recovery. Thus, if x is defined to be the number of days from onset of symptoms until becoming virus free, we are required to find the 20 th percentile of x the value which separates possible observations of x into the smallest 20% and the largest 80%. Now, the 20 th percentile of z is -0.84. Since x is an approximately normally distributed random variable with mean of 23.6 days and standard deviation of 9.2 days, the 20th percentile of x is 23.6 + (-0.84)(9.2) = 15.872 16 days. So, her detailed study should include those persons who recovered in 16 or fewer days. David W. Sabo (1999) Percentiles of Normal Random Variables Page 5 of 6 Example 10: A food technologist is involved in a project to evaluate a proposed fat substitute. One of the questions to be answered is the effect of the material on blood pressure levels. To establish a baseline, she obtains pre-ingestion blood pressure readings for a large number of participants in the study, and concludes from her data that systolic blood pressure levels in the population appears to be an approximately normally distributed random variable with a mean of 115 mm and a standard deviation of 11 mm. Describe the systolic blood pressures of those individuals who are in the 10% of the population with blood pressures most different from the mean. Solution Your systolic blood pressure can differ from the population mean by either being lower than the mean or by being higher than the mean. Since the population distribution here is approximately normal, the 10% most different from the mean will be the 5% lowest and the 5% highest. Thus, defining x = systolic blood pressure reading for a randomly selected individual this problem is asking for the same sort of results for x as Example 6 above determined for z: that is, find the value of c for which combined area = 0.10 Pr(|x - x| > c) = 0.10 Using the principle that we first solve this problem for z and then transform the z-values back to the corresponding x-values, we recall that this type of z x problem has already been solved in Example 6. µx µx + c µx - c Following the procedure illustrated in that example, we get that the z-values bounding two identical tails of total area 0.10 are 1.645 (this is one of the few z-percentiles where people often interpolate, because the two table entries bracketing 0.4500 are equidistant from this value). From this, we obtain the results that those individuals with systolic blood pressure readings less than 115 - (1.645)(11) = 96.905 mm, and those with systolic blood pressure readings greater than 115 + (1.645)(11) = 133.095 mm form the 10% of the population with readings most different from the mean. If you are formulating statistical calculations in a Microsoft Excel spreadsheet, you should use the function NORMSINV() to compute percentiles for the standard normal distribution, and NORMINV() to compute percentiles for general normal distributions. Details on the usage of these functions is available via the paste function tool. Page 6 of 6 Percentiles of Normal Random Variables David W. Sabo (1999)