https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-identify-the-distribution-of-your-datausing-minitab How to Identify the Distribution of Your Data using Minitab Minitab Blog Editor 08 March, 2012 Share I love all data, whether it’s normally distributed or downright bizarre. However, many people are more comfortable with the symmetric, bell-shaped curve of a normal distribution. It is not as intuitive to understand a Gamma distribution, with its shape and scale parameters, as it is to understand the familiar Normal distribution with its mean and standard deviation. However, it's a fact of life that not all data follow the Normal distribution. Hey, a lot of stuff is just abnormal...er...non-normally distributed. How to understand and present the practical implications of your non-normal distribution in an easy-to-understand manner is an ongoing challenge for analysts. This is particularly true for quality process improvement analysts, because a lot of their data is skewed (non-symmetric). The output of many processes often have natural limits on one side of the distribution. Natural limits include things like purity, which can’t exceed 100%. Or drill hole sizes that cannot be smaller than the drill bit. These natural limits produce skewed distributions that extend away from the natural limit. So, non-normal data is actually typical in some areas. Fear not; if you can shine the light on something and identify it, it makes it less scary. I will show you how to: Use Minitab Statistical Software to identify the distribution of your data (this post) Reap the benefits of the identification (next post) To illustrate this process, I’ll look at the body fat percentage data from my previous post about using regression analysis for prediction. You can download this data here if you want to follow along. Going with Raw Sample Data We could simply plot the raw, sample data in a histogram like this one: This histogram does show us the shape of the sample data and it is a good starting point. We can see that this distribution is skewed to the right and probably non-normal. However, this graph only tells us about the data from this specific example. You can’t make any inferences about the larger population. What can be done to increase the usefulness of these data? First, identify the distribution that your data follow. Once you do that, you can learn things about the population—and you can create some cool-looking graphs! How to Identify the Distribution of Your Data To identify the distribution, we’ll go to Stat > Quality Tools > Individual Distribution Identification in Minitab. This handy tool allows you to easily compare how well your data fit 16 different distributions. It produces a lot of output both in the Session window and graphs, but don't be intimidated. Before we walk through the output, there are 3 measures you need to know. Anderson-Darling statistic (AD): Lower AD values indicate a better fit. However, to compare how well different distributions fit the data, you should assess the p-value, as described below. P-value: You want a high p-value. It’s generally valid to compare p-values between distributions and go with the highest. A low p-value (e.g., < 0.05) indicates that the data don’t follow that distribution. For some 3-parameter distributions, the p-value is impossible to calculate and is represented by asterisks. LRT P: For 3-parameter distributions only, a low value indicates that adding the third parameter is a significant improvement over the 2-Parameter version. A higher value suggests that you may want to stick with the 2-Parameter version. So, for my data, I’ll fill out the main dialog like this: Let’s dive into the output. We’ll start with the Goodness of Fit Test table below. The very first line shows our data are definitely not normally distributed, because the p-value for Normal is less than 0.005! We'll skip the two transformations (Box-Cox and Johnson) because we want to identify the native distribution rather than transform it. A good place to start is to skim through the p-values and look for the highest. The highest p-value is for 3-Parameter Weibull. For the 3-Parameter Weibull, the LRT P is significant (0.000), which means that the third parameter significantly improves the fit. Given the higher p-value and significant LRT P value, we can pick the 3-Parameter Weibull distribution as the best fit for our data. We identified this distribution by looking at the table in the Session window, but Minitab also creates a series of graphs that provide most of the same information along with probability plots. Probability plots are a great way to visually identify the distribution that your data follow. If the data points follow the straight line, the distribution fits. You can see 3-Parameter Weibull in the graph below, as well as three other distributions that don't fit the data. Now we know what the distribution is—but what are the distribution's parameter values? For those, look at the next table down in the Minitab Session window output: How Does Identifying the Distribution of Data Help with Analysis? All right. Now we know that the body fat percentage data follow a 3-Parameter Weibull distribution with a shape of 1.85718, a scale of 14.07043, and a threshold of 16.06038. At this point you may be wondering, "How does that help us?" The answer: with this information about the distribution, we can go beyond the raw sample data and make statistical inferences about the larger population. In my next post, I'll show you how to use powerful tools in Minitab to gain deeper insights into your research area and present your results more effectively. Related blog posts: Understanding and Using Discrete Distributions How to Test Your Discrete Distribution You Might Also Like: Guest Post: Pruning Your Hypothesis Testing Decision Tree How to Correct Case Mismatches from Excel in Minitab, Fast When It’s Easier to Open Data in Minitab than in Excel ANCOVA and Blocking: 2 Vital Parts to DOE How to Test Your Discrete Distribution Minitab Blog Editor 13 December, 2012 Share In my last post we looked at different discrete distributions and how you can use them. This time, I’ll show you how to determine whether your data follow a specific discrete distribution. (Read here to see how to identify the distribution of your continuous data.) Before we start testing discrete distributions, we need to distinguish between two general cases. In some cases, it is more important to: Check the assumptions (binary data) Perform a goodness-of-fit test Checking Assumptions for Distributions that Use Binary Data For the distributions of binary data, you primarily need to determine whether your data satisfy the assumptions for that distribution. If you satisfy the assumptions, you can use the distribution to model the process. As an example, we’ll walk through the assumptions for the binomial distribution. The binomial distribution has the following four assumptions: 1. Each trial has one of two outcomes: This can be pass or fail, accept or reject, etc. 2. Each trial is independent: A trial in an experiment is independent if the likelihood of each possible outcome does not change from trial to trial. For example, if you toss a coin 50 times, each coin toss is an independent trial, because the outcome of one toss (heads or tails) does not affect the likelihood of getting a heads or tails on the next toss. 3. The probability of an event is the same for each trial: The probability doesn’t change over time. Sometimes you can make this assumption because of the physical properties that are involved, such as flipping a coin. Other times, you may want to use the P Chart to confirm this assumption. If the P Chart is in control, the probability is constant. 4. The number of trials is fixed: This assumption reflects your goal that you want to model how frequently the event occurs over a constant number of trials. Generally, determining whether your data satisfy these assumptions relies on a close understanding of the process, data collection procedure, and your goals for the data. If you satisfy all of these assumptions, you can safely use the binomial distribution. Besides the binomial distribution, there are three other distributions in Minitab statistical software that use binary data. They each have somewhat different assumptions than those listed for the binomial distribution. Distribution Negative binomial Geometric Hypergeometric Key difference from binomial You want to model the number of trials to produce a fixed number of events You want to model the number of trials to produce the first event The probability changes overtime as you draw a sample from a small population without replacement In short, if you have binary data, the choice of which binary distribution you should use depends on the population, the stability of the proportion, and what you want to do with the data. After you confirm the assumptions, you generally don’t need to perform a goodness-of-fit test. Performing a Goodness-of-Fit Test If you suspect that your data follow the Poisson distribution or a distribution based on categorical data, you should perform a goodness-of-fit test to determine whether your data follow a specific distribution. These tests compare the observed values to theoretical values to determine whether there is a significant difference. We’ll walk through some examples so you can see how easy it is to perform these tests. You can get the data here. Poisson distribution If you want to determine whether your data follow the Poisson distribution, Minitab has a test specifically for this distribution. To recap, the Poisson distribution describes a count of a characteristic (e.g., defects) over a constant observation space, such as the number of scratches on a windshield. Accident count example An insurance agent wants to monitor the number of accidents per month at a particular intersection. The agent records the number of accidents in the worksheet like this: Each cell in the worksheet represents the number of accidents over one month. In Minitab, use the Goodness-of-Fit Test for Poisson in the Stat > Basic Statistics menu. In the dialog box, in Variable, enter Accidents, and click OK. The p-value is 0.470, which is greater than the common alpha level of 0.05. This result suggests that these data follow the Poisson distribution and can be used with analyses that make this assumption. These analyses include the 1- and 2-sample Poisson rate analyses, the U Chart, and the Laney U’ Chart. Categorical distributions You can test distributions that are based on categorical data in Minitab using the Chi-Square Goodness-of-Fit Test, which is similar to the Poisson Goodness-of-Fit Test. However, because Minitab doesn’t know the distribution, you need to specify the test proportions yourself. Car color example We’ll run through an example using the proportions of car colors from my previous blog. In this example, the global proportions reported by PPG Industries are real, while the observations we “gathered” are for illustrative purposes only. Suppose we want to determine whether the distribution of car colors in our state match the global distribution. To do this, we have observers around the state record the colors of cars that were manufactured in 2012 and included in a random sample. We tally up the colors and enter the global proportions in the worksheet like this: The values in the OurState column represent the tally for each color in our sample. The global proportions are the values reported by PPG Industries. In Minitab, go to Stat > Tables > Chi-Square Goodness-of-Fit Test (One Variable). In this dialog, enter OurState in Observed counts; enter Color in Category names; and, under Test, choose Specific proportions and enter Global Proportions. Click OK. Minitab checks to see if the observed counts differ from the global distribution. A low p-value suggests that your data do not follow that distribution. In this case, the p-value is 0.012, which suggests that the distribution of car colors in our state does not match the global distribution. You can compare the Observed and Expected columns in the table to see where the largest differences occur, or look at the default graphs below. The graph above shows which colors statistically contribute the most to the significant difference. Gray and red contribute the most, more than half. However, the graph doesn’t show whether the observations are higher or lower than the expected value. The next graph addresses that. Look at the "Gray" and "Red" bars in the graph above. The observed count of gray cars is greater than the expected count. Conversely, the observed count of red cars is less than the expected count. Closing Thoughts We’ve covered a variety of discrete data and how you should test it before you use a discrete distribution to model it. In order to determine how to proceed with your discrete data, you first need to determine what type it is, or suspect that it is. To quickly summarize: Binary data: check the assumptions Poisson data: use the Poisson Goodness-of-Fit Test Other categorical data: use the Chi-Square Goodness-of-Fit Test and specify the test proportions You Might Also Like: Guest Post: Pruning Your Hypothesis Testing Decision Tree How to Correct Case Mismatches from Excel in Minitab, Fast Critical Checks for Pharmaceuticals and Healthcare: Validating Your Data Integrity and Process Performance Using My Fitness Tracker's Data To Get A Superhero-Inspired Physique The Graphical Benefits of Identifying the Distribution of Your Data Minitab Blog Editor 22 March, 2012 Share In my previous post, we identified the distribution of the body fat data. Today, we're going to explore several benefits of knowing the distribution, with a special emphasis on creating informative graphs! After all, if you are not sure what a specific distribution with such and such parameters looks like, a graph gives you the picture! Using the Distribution Information So far, we have identified the distribution and the parameter values for the body fat data from 14year-old girls. 3-Parameter Weibull Distribution: Shape = 1.85718 Scale = 14.07043 Threshold = 16.06038 How does that help us? What does this even look like? And where do important health ranges fall within this distribution? You can't tell just by looking at the parameter values. However, I'll answer all of these questions with just one cool graph! It is always a good practice to know the distribution of your data before analyzing them. Certain analyses require certain distributions. For example, it could be a costly mistake to use an analysis that strictly requires a normal distribution with nonnormal data. However, I'm not going to focus on choosing alternative analyses. Instead, I'll focus on the graphs and what we can do just by knowing the distribution. Because we have identified the best-fitting distribution, we are no longer limited to graphing the raw sample data, like we did with the histogram. We can now make inferences about the population. We can graph the best estimate for what the entire population looks like and calculate probabilities for values that fall in certain ranges. So, let’s do that. Probability Distribution Plot To answer all of our questions, we'll use Minitab's Probability Distribution Plot. I’m a huge fan of these plots. If you want to show your boss what an unusual distribution with inscrutable parameter names actually looks like, use this graph. You can highlight the effect of changing distributions and parameter values, show where target values fall in a distribution, and view the proportions that are associated with important regions. These simple plots clearly and easily communicate these advanced concepts to a non-statistical audience. Probability Distribution Plots don't use any data. Instead, you specify the distribution and enter the parameter values. You can also specify regions of interest to you. We'll use the population parameters that we've already identified. For our region of interest, I found a Web site that recommends that girls between the ages of 14-19 should have a body fat percentage between 20%-24% for health reasons. That range sounds very tight to me, but let’s see where it falls in our population distribution for 14-year-old girls. In Minitab, I’ll go to Graph > Probability Distribution Plot > View Probability and enter our distribution information in the main dialog like this: Then, I’ll click the Shaded Area tab and fill it out like this: After we click OK, Minitab displays the following graph: All in one shot you can see both the shape of the distribution and how a range of interest fits within it. I’m no health expert, but I can see that the Web site's range for ideal body fat percentages doesn’t reflect where most of the girls fall. Only 20% fall within the ideal range and it falls below the curve's peak. Already, we know something interesting is going on. Probability Plots to Calculate Percentiles Probability Plots have a similar name to Probability Distribution Plots. They are related, but Probability Plots are particularly good at determining whether the data fit a distribution (check, we did that already) and calculating percentiles based on that distribution. In general, the nth percentile has n% of the population below it, and (100-n)% of the population above it. Percentiles are extra important for nonnormal distributions because you use them to find the center and spread of your distribution. Here's why. Intuitively we think of the mean and standard deviation as the center and spread for a normal distribution. Further, a good rule of thumb for normal distributions is that two-thirds of the population falls symmetrically within 1 standard deviation from the mean. About 95% fall within 2 standard deviations. However, none of this is true for non-symmetric distributions. The mean is not at the center and the general rule of thumb for the spread no longer works. However, once you identify your distribution you can calculate percentiles in order to find the center and spread of the population. For example, if you want to find the middle value (median) and the range in which the middle 95% of a nonnormal population falls, calculate the 2.5th, 50th, and 97.5th percentiles (97.5 - 2.5 = 95). The median is the 50th percentile; half of the population are above the median and half are below. We'll calculate the body fat percentages that correspond to the 2.5th, 50th, and 97.5th percentiles. Also, let's see what percentile corresponds to the upper limit of the supposed ideal body fat range: 24%. To do this, you'll need to open the data, which you can find here. 1. 2. 3. 4. In Minitab, go to Graph > Probability Plot > Single. In the main dialog, enter %Fat as the Graph Variable. Click the Distribution button and choose 3-parameter Weibull. Click OK. Click the Scale button, and uncheck the Adjust x-scale for threshold . . . checkbox. This produces a curved distribution fit line but allows the percentiles to be read straight off the graph. 5. Still under Scale, click the Percentile Lines tab, and fill it out as shown below to produce our desired percentiles. Click OK in all dialogs. We get the following graph: We already knew that these data follow this distribution from before, and the output reconfirms it. The data points follow the center line and the p-value in the legend is greater than 0.500, which is greater than any common alpha value. Hence, these data follow the 3-parameter Weibull distribution. In the graph, the data values are on the X-axis and the percentiles are on the Y-axis. For this population, the 50th percentile (the median) corresponds to a body fat percentage of 27.6%. 95% of the population should fall between the 2.5th and 97.5th percentiles, which correspond to 18.0% and 44.5% body fat. Because of the non-symmetric shape of the distribution, the median (27.6) is closer to the low value than the high value. 24% body fat corresponds to the 29th percentile. 24% is the top end of the ideal range recommended by the Web site but it is a fairly low percentile for this population. Said another way, 71% of the population exceeds the upper limit of the range. Yikes! Closing Thoughts For the issue relating to the ideal body fat range, it's fairly clear that something is going on here. I'm not a health expert, so I don't know the answer. However, it appears that either the range is incorrect or a large majority (71%) of 14-year-old girls exceed the recommended range. Only 20% actually fall within the range. However, with a few simple tools in Minitab, we have brought the implications of these data to life! Just as important, we can easily present these results to others in an easy-tounderstand manner. I hope after reading this you're more comfortable with nonnormal distributions and can see the advantages of identifying your data's distribution. I’ve shown how you can transcend your raw sample data and make useful inferences about the larger population that your data represent. You can safely embrace your nonnormal data! You Might Also Like: Guest Post: Pruning Your Hypothesis Testing Decision Tree How to Correct Case Mismatches from Excel in Minitab, Fast When It’s Easier to Open Data in Minitab than in Excel ANCOVA and Blocking: 2 Vital Parts to DOE