Module Four: Normal distribution and it’s applications to interlaboratory testing When we conduct an inter-laboratory testing, we often observe continuous variables, e.g., the amount of chloride of a water sample, the beta-carotene in a blood sample, the blood pressure are continuous variables. When we construct a relative frequency histogram, it is very likely that the shape of the distribution is bellshaped, that is a few possible values are small, a few are large, and most of them are around the average. Such type of distribution is what we call NORMAL distribution. Fox example, Blood Pressure, the beta-carotene in a blood sample, amount of chloride of a water sample mostly follow normal curves. A histogram with imposed normal curve for 1900 individuals’ systolic blood pressure Histogram of Systolic Blood Pressure, with Normal Curve, for indivduals age from 15-20 m=115, 400 s = 14 300 Frequency The imposed smooth curve looks like a bell-shape. If the blood pressure follows a normal curve with mean 115 and s.d. 14, We use the notation: X ~ N(m,s) For this case, X ~N (115,14). 200 An immediate question is: How can we detect if the distribution indeed follows a normal curve. 100 0 60 110 160 210 Systolic Blood Pressure Our interest may be to check if the blood pressure follows a normal distribution, to find out what proportion of individuals whose blood pressure is at risk (150 ml or higher), or to identify extreme cases. When and How do you use Normal Distribution in real world situations? Normal curve describes the probability of occurrences of many real situations. •Most of statistical techniques, including the techniques used for analyzing interlaboratory testing data, assume that the response variable approximately follows a normal curve. •These methods may not be valid if the response does not follow a normal distribution. It is, therefore, important to learn how to check if a response variable follows a normal distribution or not. For this reason, we need to learn some basic properties of a normal distribution, to learn how to compute probabilities and percentiles for a normal distribution. •In this module, we will discuss: •The use of z-table and Minitab to compute probabilities and percentiles. •Techniques of checking if a response variable follows a normal distribution. The normal probability distribution provides a good model for describing data that have mound-shaped frequency distributions. The Normal Probability Distribution: ( xm )2 1 2s 2 f ( x) = e s 2p where e = 2.718 and p = 3.142; m and s (s > 0 ) are the parameters that represent the population mean and standard deviation. We will use the notation: X ~ N(m , s). This means X is distributed as Normal with mean m and standard deviation s. Some examples of normal random variables are : X = Adult Height , X = Scores of s national test, X = Gas price, X = Blood pressure NOTE: X = salary of individuals who are 40 years or old before retire does not follow a normal curve. It is a skewed to right distribution. Properties of Normal Distribution This figure shows three such distributions with differing values of m and s . s2 s1 s3 m1 m2 m3 Mean determines the center. In this case, m1 < m2 < m3 Standard deviation measures the variability. In this case, s2 < s1 < s3 Large values of s reduce the height of the curve and increase the spread. Small values of s increase the height of the curve and reduce the spread. Some properties for X ~ N(m , s) f(x) P( m-a <X<m) = P(m<X<m+a) P(X<m-a) P(X > m+a) m-a Also: m P(X> m) = P(X < m) = .5 m+a X Example: Every year, universities recruit students using their SAT scores. Based on the previous information, we know that SAT scores follows a normal curve with the mean 1000 and standard deviation 180. In the past, CMU admits students with SAT 1090 or higher. Q1: What is the percent of high school students who can receive CMU admission? Q2: If CMU decides to higher the SAT admission limit to only admit the top 20% of high school graduates. What should be the new SAT admission limit? Q3: A student scored 1200, and claim he is in the top 10%. Is this a correct claim? Tabulated Areas of the Normal Probability Distributions • How do you solve the SAT admission problem? First, we need to rewrite the problem using the notation we are familiar. Let call X = SAT scores. Then from the given information, we know: X ~ N(1000, 180). Q1: asks for P( X > 1090) Q2: asks for a value of X, call it: xo, the admission limit, so that P( X > xo ) = .2 Q3: asks for comparing P(X > 1200) with .1 How do we solve these problems? • The probability that a continuous random variable x assumes a value in the interval from a to b is the area under the probability density function between the points a and b. One can use computer such as Minitab, or use a standardized Z-table. The Standard Normal Random Variable: The standardized normal random variable z, is defined as z = (x m)/ s , or equivalently, x = m + zs . The standard probability distribution has a mean of zero and a standard deviation of 1, that is Z ~ N(0,1) The area under the standard normal curve between mean z = 0 and a specified positive value of z, say, z0 , is the probability Some books use this P(0 z z0 ) table. Some use other type of tables. 0 zo Z X Back to the SAT score problem: X ~ N(1000, 180) P( X>1090) 1000 1090 X, SAT score Z=(x-1000)/180 (1000-1000)/180 = 0 0.5 = (1090-1000)/180 The idea is to transform X~ N(m,s) to Z(0,1) using z = (x-m)/s P(X > 1090) = P(Z > (1090-1000)/180 ) = P(Z > 0.5) Now Z-table can be applied. Example Find P (0 < z < 1.63) Solution 1. Draw a normal curve, shade the area of interest. 2. Rewrite the question in the way that the Z-table can be applies. That is in the forms of P( 0 < Z < zo) For this example, it is already in this form, so using the Z-table, we obtain: P (0 < z < 1.63) = .4484. Some additional exercises: Find P( Z < 1.96), Find P(-1.24< Z < .68), Find P( Z > -1.64) Calculating Probabilities for a General Normal Random Variable, X: 1. Draw a normal curve for X, shade the area of interest, 2. Transform X to Z. - Standardize the interval of interest, write it as the equivalent interval in terms of z. - The probability of interest is the area that you find using the standard normal probability distribution. Now, Back to the the SAT example, do the following exercises: SAT score, X follows a normal distribution with mean 1000 and s.d., 180. That is, X ~ N(1000, 180) Find P(X < 800) Find P(750 < X < 900) Find P(1180 < X < 1360) How about the question of determining the SAT admission score for CMU so that the top 20% will receive admission from CMU. Answer: X ~ N(1000, 180). The problem is to find the admission score, xo so that P(X > x0) = .2 This is a problem we are looking for a score, not a probability. We are reversing the problem solving procedure, here. Similar technique is applied here: 1. Draw a normal curve, shade the area of interest. 2. Transform from X to Z. 3. Rewrite the problem in terms of Z. 4. Solve for the standardized value, zo using Z-table reversely. 5. Transform zo back to xo by xo = m + s(zo) To solve for the admission score xo so that P(X > xo) = .2 Draw the normal curve, shade the area of interest, transform to Z. .2 = P(X > xo) = P(Z > zo) implies P(0 < Z < zo) = .3 This is a form we can use Z-table. Looking inside the table, find the closed probability to .3, which is .2995. By the Z-table, .2995 = P(0 < Z < .84). Therefore, zo = .84, which is the standardized admission limit. So, solving for xo, we have xo = m + s(zo) = 1000 + (180)(.84) = 1151.2 The CMU SAT admission limit will be about 1151.2 (In actual application for setting up the policy, we can use 1150 as the new admission standard.) Hands-on activities: Q-a:For the SAT example, X ~ (1000, 180), suppose a university admits only top 5%. Find their admission limit. Q-b: Find the 5th percentile of SAT score. Q-c: Find the Q3 SAT score (75th percentile). Use Minitab to compute cumulative probabilities and percentiles for a normal distribution 1. 2. 3. 4. 5. Go to Calc, choose Probability Distributions, then select Normal. In the Dialog box, Density probability = f(x), Cumulative probability = P( X < a) for any given a, Inverse cumulative probability is the 100pth percentile, xo , so that P(X < xo) = p. Choose the one you are computing. Enter Mean and s.d.. By default, it is N(0,1). To compute cumulative probability, you need to provide ‘a’ values, which may be created and recorded in a column, e.g., C3, or simply to provide the constant ‘a’. To compute inverse cumulative probability, you need to provide the cumulative probabilities, which must be in (0,1). Methods for detecting the discrepancy of the distribution of a response variable from normal distribution. Consider the example of Blood Pressure data. From the histogram and the normal curve imposed onto the histogram using Minitab, we can see that the blood pressure generally speaking follows a normal curve. However, there seems to have a few unusually high blood pressures. The question is ‘How well the blood pressure follows a normal curve?’. The imposing normal curve helps us to quickly identify serious discrepancy from normal. However, if the discrepancy is not very serious, it is difficult to simply observe the shape of a histogram. We will discuss three ways for checking the normality of a response: 1. Imposing normal curve onto the histogram, 2. Probability plot, 3. Numerical methods for testing the degree of departure from normal. Imposing a normal curve onto a histogram for the blood pressure data of 1900 young adults between 15-20 years old: The normal curve indicates there are a few large blood pressure measurements. In fact, the descriptive statistics shows the highest is 210, which is much higher than 2 s.d. from the average. It suggests 210 is very rare. One should check immediately if there is a typo or not. Histogram of Systolic Blood Pressure, with Normal Curve, for indivduals age from 15-20 400 300 Frequency 1. 200 100 0 60 110 160 Systolic Blood Pressure How to construct this plot using Minitab: • Go to Stat, choose Basic Statistics, choose Display Descriptive Statistics. • Enter the variable. Click on the ‘Graphs’ option, • In the Graphs option Dialog, you can have a variety of choices. One of them is Histogram with Normal Curve. 210 2. Normal Probability Plot: It is a two-dimensional plot. The Y-axis is the estimated cumulative probabilities computed by: rank 3 / 8 n + 1/ 4 The X-axis is the original data in ascending order. Diagnosis: When the data follow a normal curve, the dotted points should follow a straight line When data are skewed-to-right, the plot would look like: When data are skewed-to-left, the plot would look like: Normal Probability Plot for the Blood Pressure Data .999 .99 .80 0.90000 0.75000 .50 0.50000 .20 0.25000 0.10000 100 132.922 124.235 .001 114.582 .01 104.929 .05 96.241 Probability .95 150 200 Systolic Blood Pressure Average: 114.590 StDev: 14.0595 N: 1909 Anderson-Darling Normality Test A-Squared: 11.502 P-Value: 0.000 Based on the Normal probability plot, it indicates that the systolic blood pressure does not follow a normal curve. The pattern also shows that the distribution is somewhat skewed-to-the-right. 3. Test statistic for testing if the blood pressure follows a normal curve or not. Graphical methods are good to show the pattern and gives us pretty clear picture that the data do not follow normal. Numerically, there are methods that will test such a hypothesis. The test statistic is given in the same graph of the Normal Probability Plot. The Anderson-Darling’s Normality Test is presented here. The AD-value = 11.5, and the corresponding p-value is .000 Note: p-value tells us how far the distribution of blood pressure is away from normal. The smaller the p-value, the less likely the response variable follows a normal curve. A common cut-off point is 5%. In this case, p-value = .000, which is clear that the distribution of Systolic blood pressure does not follow normal. How to construct a Normal Probability Plot and carry out the Anderson-Darling’s Normality Test? 1. 2. 3. Go to Stat, choose Basic Statistics, then select Normality Test. In the Dialog, enter variable name. Reference Probabilities allow us to provide a column of cumulative probabilities so that the normal probability plot will show the percentiles for each given cumulative probability. • • Note: As we have observed that all three methods give us similar results. Therefore, the systolic blood pressure for 15 to 20 years old young adults does not follow a normal distribution from the 1909 cases. Note: Once we find out the distribution is not normal, it is critical to take some further analysis: – carefully check the data to see if there are any typos, – Examine the data using some descriptive measures or other plots to identify extreme cases (Details will be discussed in another module). Hands-on Activity: Use the above three methods to check the distribution of Diastolic Blood Pressure data. Actions to deal with extreme cases For observational studies (such as survey): • The sample sizes are usually large, and that it is often impossible to find out possible causes that resulted the extreme data after the data are collect. Therefore, it is critical to collect background and environmental variables that may have potential impact to the results. For experimental studies, such as inter-laboratory testing: • It is important to look for possible causes that resulted the extremes. The study is usually conducted under a controlled experimental environment. It is more likely to find out causes for the extremes, or be able to explain the possible causes. Deletion of extremes Vs. Making transformation to normal One must be careful of deleting extremes. Especially when we are not able to find any causes and the values are reasonable within the context of the study. This may be an indication that the distribution of the response is skewed. For situations such as this, an appropriate approach is to transform the data to be closer to normal. Method for transforming a variable to normal When the data show a skewed distribution, statistical methods such as Analysis of Variance may not be valid. An approach is to make a mathematical transformation of the variable so that the transformed variable will be closer to normal. Some tips for variable transformation: If variable, Y, is skewed-to-right: Then, ln(Y), log10(Y), or Y will be closer of normal. (If there are zero’s, add each data value by .5, first. If variable, Y, is skewed-to-left: ln(1/Y), log10(1/Y), 1/ Y or Ya, a >1 will be closer to normal. An example of Transformation: The life time of 50 light bulbs are tested by letting them on all the time until it burns out. The data recorded (in months). Here are the histogram and the normal probability test of the raw data, the ln transformed data and Square-root transformed data: Histogram of Sqrt(Y) Histogram of Life Time 9 10 8 6 Frequency Frequency 7 5 4 5 3 2 1 0 0 1.0 0 10 1.5 2.0 2.5 20 3.0 3.5 4.0 4.5 5.0 Sqrt(Y) Life Time Histogram of Ln(Y) The raw data is skewed-to-right. The Square-root transformation works well. Frequency The Ln transformation does not work well. 10 5 0 -0.4 0.0 0.4 0.8 1.2 1.6 Ln(Y) 2.0 2.4 2.8 3.2 The normal probability plots and Anderson-Darling’s tests for the life-time data: Normal Probability Plot for Sqrt(Y) Normal Probability Plot for the Life Time Data .999 .99 .999 .95 Probability .99 Probability .95 .80 .50 .80 .50 .20 .05 .20 .01 .05 .001 .01 1 .001 2 3 4 5 Sqrt(Y) 0 10 20 Average: 2.86016 StDev: 1.09027 N: 50 Anderson-Darling Normality Test A-Squared: 0.430 P-Value: 0.297 Life Time Average: 9.34544 StDev: 6.29248 N: 50 Anderson-Darling Normality Test A-Squared: 0.906 P-Value: 0.019 Normal Probability Plot for Ln(Y) .999 .99 As the normal probability plots and the Normality test results indicate, the Sqrt(Y) is approximately normal. The other two are not. Probability .95 .80 .50 .20 .05 .01 .001 0 1 2 3 Ln(Y) Average: 1.93005 StDev: 0.886131 N: 50 Anderson-Darling Normality Test A-Squared: 1.071 P-Value: 0.007 Hands-on Activity Analyze the distribution of variable GR36-Lab-Mean-1 in the TAPPI inter-laboratory testing study, and determine an appropriate transformation to make the data closer to a normal distribution.