4 Continuous Random Variables and Probability Distributions Copyright © Cengage Learning. All rights reserved. 4.6 Probability Plots Copyright © Cengage Learning. All rights reserved. Probability Plots An investigator will often have obtained a numerical sample x1, x2,…, xn and wish to know whether it is plausible that it came from a population distribution of some particular type (e.g., from a normal distribution). For one thing, many formal procedures from statistical inference are based on the assumption that the population distribution is of a specified type. The use of such a procedure is inappropriate if the actual underlying probability distribution differs greatly from the assumed type. 3 Probability Plots For example, the article “Toothpaste Detergents: A Potential Source of Oral Soft Tissue Damage” (Intl. J. of Dental Hygiene, 2008: 193–198) contains the following statement: “Because the sample number for each experiment (replication) was limited to three wells per treatment type, the data were assumed to be normally distributed.” 4 Probability Plots As justification for this leap of faith, the authors wrote that “Descriptive statistics showed standard deviations that suggested a normal distribution to be highly likely.” Note: This argument is not very persuasive. Additionally, understanding the underlying distribution can sometimes give insight into the physical mechanisms involved in generating the data. An effective way to check a distributional assumption is to construct what is called a probability plot. 5 Probability Plots The essence of such a plot is that if the distribution on which the plot is based is correct, the points in the plot should fall close to a straight line. If the actual distribution is quite different from the one used to construct the plot, the points will likely depart substantially from a linear pattern. 6 Sample Percentiles 7 Sample Percentiles The details involved in constructing probability plots differ a bit from source to source. The basis for our construction is a comparison between percentiles of the sample data and the corresponding percentiles of the distribution under consideration. We know that the (100p)th percentile of a continuous distribution with cdf F() is the number (p) that satisfies F((p)) = p. That is, (p) is the number on the measurement scale such that the area under the density curve to the left of (p) is p. 8 Sample Percentiles Thus the 50th percentile satisfies , and the 90th percentile (.5) satisfies F((.5)) = .5, and the 90th percentile satisfies F((.9)) = .9. Consider as an example the standard normal distribution, for which we have denoted the cdf by () . From Appendix Table A.3, we find the 20th percentile by locating the row and column in which .2000 (or a number as close to it as possible) appears inside the table. 9 Sample Percentiles Since .2005 appears at the intersection of the –.8 row and the .04 column, the 20th percentile is approximately –.84. Similarly, the 25th percentile of the standard normal distribution is (using linear interpolation) approximately –.675 . Roughly speaking, sample percentiles are defined in the same way that percentiles of a population distribution are defined. 10 Sample Percentiles The 50th-sample percentile should separate the smallest 50% of the sample from the largest 50%, the 90th percentile should be such that 90% of the sample lies below that value and 10% lies above, and so on. Unfortunately, we run into problems when we actually try to compute the sample percentiles for a particular sample of n observations. If, for example, n = 10 we can split off 20% of these values or 30% of the data, but there is no value that will split off exactly 23% of these ten observations. 11 Sample Percentiles To proceed further, we need an operational definition of sample percentiles (this is one place where different people do slightly different things). Recall that when n is odd, the sample median or 50thsample percentile is the middle value in the ordered list, for example, the sixth-largest value when n = 11. This amounts to regarding the middle observation as being half in the lower half of the data and half in the upper half. Similarly, suppose n = 10. 12 Sample Percentiles Then if we call the third-smallest value the 25th percentile, we are regarding that value as being half in the lower group (consisting of the two smallest observations) and half in the upper group (the seven largest observations). 13 Sample Percentiles This leads to the following general definition of sample percentiles. Definition Order the n sample observations from smallest to largest. Then the ith smallest observation in the list is taken to be the [100(i – .5)/n]th sample percentile. Once the percentage values 100(i – .5)/n(i = 1, 2,…, n) have been calculated, sample percentiles corresponding to intermediate percentages can be obtained by linear interpolation. 14 Sample Percentiles For example, if n = 10, the percentages corresponding to the ordered sample observations are 100(1 – .5)/10 = 5%, 100(2 – .5)/10 = 15%, 25%,…, and 100(10 – .5)/10 = 95%. The 10th percentile is then halfway between the 5th percentile (smallest sample observation) and the 15th percentile (second-smallest observation). For our purposes, such interpolation is not necessary because a probability plot will be based only on the percentages 100(i – .5)/n corresponding to the n sample observations. 15 A Probability Plot 16 A Probability Plot Suppose now that for percentages 100(i – .5)/n(i = 1,…, n) the percentiles are determined for a specified population distribution whose plausibility is being investigated. If the sample was actually selected from the specified distribution, the sample percentiles (ordered sample observations) should be reasonably close to the corresponding population distribution percentiles. 17 A Probability Plot That is, for i = 1, 2,…, n there should be reasonable agreement between the ith smallest sample observation and the [100(i – .5)/n]th percentile for the specified distribution. Let’s consider the (population percentile, sample percentile) pairs—that is, the pairs for i = 1,…, n. Each such pair can be plotted as a point on a two-dimensional coordinate system. 18 A Probability Plot If the sample percentiles are close to the corresponding population distribution percentiles, the first number in each pair will be roughly equal to the second number. The plotted points will then fall close to a 45 line . Substantial deviations of the plotted points from a 45 line cast doubt on the assumption that the distribution under consideration is the correct one. 19 Example 29 The value of a certain physical constant is known to an experimenter. The experimenter makes n = 10 independent measurements of this value using a particular measurement device and records the resulting measurement errors (error = observed value – true value). These observations appear in the accompanying table. 20 Example 29 cont’d Is it plausible that the random variable measurement error has a standard normal distribution? The needed standard normal (z) percentiles are also displayed in the table. Thus the points in the probability plot are (–1.645, –1.91), (–1.037, –1.25),…, and (1.645, 1.56). 21 Example 29 cont’d Figure 4.33 shows the resulting plot. Although the points deviate a bit from the 45 line, the predominant impression is that this line fits the points very well. The plot suggests that the standard normal distribution is a reasonable probability model for measurement error. Plots of pairs (z percentile, observed value) for the data of Example 29: Figure 4.33 22 Example 29 cont’d Figure 4.34 shows a plot of pairs (z percentile, observation) for a second sample of ten observations 45. The line gives a good fit to the middle part of the sample but not to the extremes. The plot has a well-defined S-shaped appearance. The two smallest sample observations are considerably larger than the corresponding z percentiles (the points on the far left of the plot are well above the 45 line). Plots of pairs (z percentile, observed value) for the data of Example 29: Figure 4.34 23 Example 29 cont’d Similarly, the two largest sample observations are much smaller than the associated z percentiles. This plot indicates that the standard normal distribution would not be a plausible choice for the probability model that gave rise to these observed measurement errors. 24 A Probability Plot An investigator is typically not interested in knowing just whether a specified probability distribution, such as the standard normal distribution (normal with = 0 and = 1) or the exponential distribution with = .1, is a plausible model for the population distribution from which the sample was selected. Instead, the issue is whether some member of a family of probability distributions specifies a plausible model—the family of normal distributions, the family of exponential distributions, the family of Weibull distributions, and so on. 25 A Probability Plot The values of the parameters of a distribution are usually not specified at the outset. If the family of Weibull distributions is under consideration as a model for lifetime data, are there any values of the parameters and for which the corresponding Weibull distribution gives a good fit to the data? Fortunately, it is almost always the case that just one probability plot will suffice for assessing the plausibility of an entire family. 26 A Probability Plot If the plot deviates substantially from a straight line, no member of the family is plausible. When the plot is quite straight, further work is necessary to estimate values of the parameters that yield the most reasonable distribution of the specified type. Let’s focus on a plot for checking normality. Such a plot is useful in applied work because many formal statistical procedures give accurate inferences only when the population distribution is at least approximately normal. 27 A Probability Plot These procedures should generally not be used if the normal probability plot shows a very pronounced departure from linearity. The key to constructing an omnibus normal probability plot is the relationship between standard normal (z) percentiles and those for any other normal distribution: percentile for a normal (, ) distribution = + (corresponding z percentile) Consider first the case = 0. 28 A Probability Plot If each observation is exactly equal to the corresponding normal percentile for some value of , the pairs ( [ z percentile], observation) fall on a 45 line, which has slope 1. This then implies that the (z percentile, observation) pairs fall on a line passing through (0, 0) (i.e., one with y-intercept 0) but having slope rather than 1. The effect of a nonzero value of is simply to change the y-intercept from 0 to . 29 A Probability Plot A plot of the n pairs ([100(i – .5)/n]th z percentile, ith smallest observation) on a two-dimensional coordinate system is called a normal probability plot. If the sample observations are in fact drawn from a normal distribution with mean value and standard deviation, the points should fall close to a straight line with slope and intercept . Thus a plot for which the points fall close to some straight line suggests that the assumption of a normal population distribution is plausible. 30 Example 30 The accompanying sample consisting of n = 20 observations on dielectric breakdown voltage of a piece of epoxy resin appeared in the article “Maximum Likelihood Estimation in the 3-Parameter Weibull Distribution (IEEE Trans. on Dielectrics and Elec. Insul., 1996: 43–55). The values of (i – .5)/n for which z percentiles are needed are (1 – .5)/20 = .025, (2 – .5)/20 = .075,…, and .975. 31 Example 30 cont’d Figure 4.35 shows the resulting normal probability plot. The pattern in the plot is quite straight, indicating it is plausible that the population distribution of dielectric breakdown voltage is normal. Normal probability plot for the dielectric breakdown voltage sample Figure 4.35 32 A Probability Plot There is an alternative version of a normal probability plot in which the z percentile axis is replaced by a nonlinear probability axis. The scaling on this axis is constructed so that plotted points should again fall close to a line when the sampled distribution is normal. Figure 4.36 shows such a plot from Minitab for the breakdown voltage data of Example 4.30. Normal probability plot of the breakdown voltage data from Minitab Figure 4.36 33 A Probability Plot A nonnormal population distribution can often be placed in one of the following three categories: 1. It is symmetric and has “lighter tails” than does a normal distribution; that is, the density curve declines more rapidly out in the tails than does a normal curve. 2. It is symmetric and heavy-tailed compared to a normal distribution. 3. It is skewed. 34 A Probability Plot A uniform distribution is light-tailed, since its density function drops to zero outside a finite interval. The density function f (x) = 1/[(1 + x2)] for <x< heavy-tailed, since declines much less rapidly than does . is Lognormal and Weibull distributions are among those that are skewed. When the points in a normal probability plot do not adhere to a straight line, the pattern will frequently suggest that the population distribution is in a particular one of these three categories. 35 A Probability Plot When the distribution from which the sample is selected is light-tailed, the largest and smallest observations are usually not as extreme as would be expected from a normal random sample. Visualize a straight line drawn through the middle part of the plot; points on the far right tend to be below the line (observed value < z percentile), whereas points on the left end of the plot tend to fall above the straight line (observed value > percentile). 36 A Probability Plot The result is an S-shaped pattern of the type pictured in Figure 4.34. Plots of pairs (z percentile, observed value) for the data of Example 29: Figure 4.34 37 A Probability Plot A sample from a heavy-tailed distribution also tends to produce an S-shaped plot. However, in contrast to the lighttailed case, the left end of the plot curves downward (observed percentile), as shown in Figure 4.37(a). Probability plots that suggest a nonnormal distribution: (a) a plot consistent with a heavy-tailed distribution; Figure 4.37(a) 38 A Probability Plot If the underlying distribution is positively skewed (a short left tail and a long right tail), the smallest sample observations will be larger than expected from a normal sample and so will the largest observations. In this case, points on both ends of the plot will fall above a straight line through the middle part, yielding a curved pattern, as illustrated in Figure 4.37(b). (b) a plot consistent with a positively skewed distribution Figure 4.37(b) 39 A Probability Plot A sample from a lognormal distribution will usually produce such a pattern. A plot of (z percentile, ln(x)) pairs should then resemble a straight line. 40 Beyond Normality 41 Beyond Normality Consider a family of probability distributions involving two parameters, 1 and 2, and let F(x; 1 and 2) denote the corresponding cdf’s. The family of normal distributions is one such family, with 1 = , and 2 = and F(x; , ) = [(x – )/]. Another example is the Weibull family, 1 = with 2 = , and F(x; , ) = 1 – Still another family of this type is the gamma family, for which the cdf is an integral involving the incomplete gamma function that cannot be expressed in any simpler form. 42 Beyond Normality The parameters 1 and 2 are said to be location and scale parameters, respectively, if F(x; 1 and 2) is a function of (x – 1)/2. The parameters and of the normal family are location and scale parameters, respectively. Changing shifts the location of the bell-shaped density curve to the right or left, and changing amounts to stretching or compressing the measurement scale (the scale on the horizontal axis when the density function is graphed). Another example is given by the cdf F(x; 1, 2) = 1 – <x< 43 Beyond Normality A random variable with this cdf is said to have an extreme value distribution. It is used in applications involving component lifetime and material strength. Although the form of the extreme value cdf might at first glance suggest that 1 is the point of symmetry for the density function, and therefore the mean and median, this is not the case. Instead, P(X 1) = F(1; 1, 2) = 1 – e–1 = .632, and the density function f(x; 1, 2 ) 5 F (x; 1, 2 ) is negatively skewed (a long lower tail). 44 Beyond Normality Similarly, the scale parameter 2 is not the standard deviation ( = 1 – .57722 and = 1.2832). However, changing the value of 1 does change the location of the density curve, whereas a change in 2 rescales the measurement axis. The parameter of the Weibull distribution is a scale parameter, but is not a location parameter. A similar comment applies to the parameters and of the gamma distribution. 45 Beyond Normality In the usual form, the density function for any member of either the gamma or Weibull distribution is positive for x > 0 and zero otherwise. A location parameter can be introduced as a third parameter (we did this for the Weibull distribution) to shift the density function so that it is positive if x > y and zero otherwise. When the family under consideration has only location and scale parameters, the issue of whether any member of the family is a plausible population distribution can be addressed via a single, easily constructed probability plot. 46 Beyond Normality One first obtains the percentiles of the standard distribution, the one with 1 = 0 and 2 = 1, for percentages 100(i – .5)/n (i = 1,…, n). The n (standardized percentile, observation) pairs give the points in the plot. This is exactly what we did to obtain an omnibus normal probability plot. Somewhat surprisingly, this methodology can be applied to yield an omnibus Weibull probability plot. 47 Beyond Normality The key result is that if X has a Weibull distribution with shape parameter and scale parameter , then the transformed variable ln(X) has an extreme value distribution with location parameter 1 = ln() and scale parameter 1/. Thus a plot of the (extreme value standardized percentile, ln(x)) pairs showing a strong linear pattern provides support for choosing the Weibull distribution as a population model. 48 Example 31 The accompanying observations are on lifetime (in hours) of power apparatus insulation when thermal and electrical stress acceleration were fixed at particular values (“On the Estimation of Life of Power Apparatus Insulation Under Combined Electrical and Thermal Stress,” IEEE Trans. on Electrical Insulation, 1985: 70–78). A Weibull probability plot necessitates first computing the 5th, 15th, . . . , and 95th percentiles of the standard extreme value distribution. The (100p)th percentile (p) satisfies p = F((p)) = 1 – 49 Example 31 cont’d from which (p) = ln[–ln(1 – p)]. 50 Example 31 cont’d The pairs (–2.97, 5.64), (–1.82, 6.22),…, (1.10, 7.67) are plotted as points in Figure 4.38. The straightness of the plot argues strongly for using the Weibull distribution as a model for insulation life, a conclusion also reached by the author of the cited article. A Weibull probability plot of the insulation lifetime data Figure 4.38 51