Uploaded by vasilydunaev

MINITAB STATICSTIC ANALYSIS OF DATA

advertisement
https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-identify-the-distribution-of-your-datausing-minitab
How to Identify the Distribution of Your Data
using Minitab
Minitab Blog Editor 08 March, 2012



Share
I love all data, whether it’s normally distributed or downright bizarre. However, many people are
more comfortable with the symmetric, bell-shaped curve of a normal distribution. It is not as intuitive
to understand a Gamma distribution, with its shape and scale parameters, as it is to understand the
familiar Normal distribution with its mean and standard deviation.
However, it's a fact of life that not all data follow the Normal distribution. Hey, a lot of stuff is just
abnormal...er...non-normally distributed. How to understand and present the practical implications of
your non-normal distribution in an easy-to-understand manner is an ongoing challenge for analysts.
This is particularly true for quality process improvement analysts, because a lot of their data is
skewed (non-symmetric). The output of many processes often have natural limits on one side of the
distribution. Natural limits include things like purity, which can’t exceed 100%. Or drill hole sizes that
cannot be smaller than the drill bit. These natural limits produce skewed distributions that extend
away from the natural limit. So, non-normal data is actually typical in some areas.
Fear not; if you can shine the light on something and identify it, it makes it less scary. I will show you
how to:


Use Minitab Statistical Software to identify the distribution of your data (this post)
Reap the benefits of the identification (next post)
To illustrate this process, I’ll look at the body fat percentage data from my previous post about
using regression analysis for prediction. You can download this data here if you want to follow
along.
Going with Raw Sample Data
We could simply plot the raw, sample data in a histogram like this one:
This histogram does show us the shape of the sample data and it is a good starting point. We can
see that this distribution is skewed to the right and probably non-normal. However, this graph only
tells us about the data from this specific example. You can’t make any inferences about the larger
population.
What can be done to increase the usefulness of these data? First, identify the distribution that your
data follow. Once you do that, you can learn things about the population—and you can create some
cool-looking graphs!
How to Identify the Distribution of Your Data
To identify the distribution, we’ll go to Stat > Quality Tools > Individual Distribution
Identification in Minitab. This handy tool allows you to easily compare how well your data fit 16
different distributions. It produces a lot of output both in the Session window and graphs, but don't be
intimidated. Before we walk through the output, there are 3 measures you need to know.
Anderson-Darling statistic (AD): Lower AD values indicate a better fit. However, to compare how
well different distributions fit the data, you should assess the p-value, as described below.
P-value: You want a high p-value. It’s generally valid to compare p-values between distributions and
go with the highest. A low p-value (e.g., < 0.05) indicates that the data don’t follow that distribution.
For some 3-parameter distributions, the p-value is impossible to calculate and is represented by
asterisks.
LRT P: For 3-parameter distributions only, a low value indicates that adding the third parameter is a
significant improvement over the 2-Parameter version. A higher value suggests that you may want to
stick with the 2-Parameter version.
So, for my data, I’ll fill out the main dialog like this:
Let’s dive into the output. We’ll start with the Goodness of Fit Test table below.
The very first line shows our data are definitely not normally distributed, because the p-value for
Normal is less than 0.005!
We'll skip the two transformations (Box-Cox and Johnson) because we want to identify the native
distribution rather than transform it.
A good place to start is to skim through the p-values and look for the highest. The highest p-value is
for 3-Parameter Weibull. For the 3-Parameter Weibull, the LRT P is significant (0.000), which means
that the third parameter significantly improves the fit.
Given the higher p-value and significant LRT P value, we can pick the 3-Parameter Weibull
distribution as the best fit for our data. We identified this distribution by looking at the table in the
Session window, but Minitab also creates a series of graphs that provide most of the same
information along with probability plots.
Probability plots are a great way to visually identify the distribution that your data follow. If the data
points follow the straight line, the distribution fits. You can see 3-Parameter Weibull in the graph
below, as well as three other distributions that don't fit the data.
Now we know what the distribution is—but what are the distribution's parameter values? For those,
look at the next table down in the Minitab Session window output:
How Does Identifying the Distribution of Data
Help with Analysis?
All right. Now we know that the body fat percentage data follow a 3-Parameter Weibull distribution
with a shape of 1.85718, a scale of 14.07043, and a threshold of 16.06038.
At this point you may be wondering, "How does that help us?" The answer: with this information
about the distribution, we can go beyond the raw sample data and make statistical inferences about
the larger population.
In my next post, I'll show you how to use powerful tools in Minitab to gain deeper insights into your
research area and present your results more effectively.
Related blog posts:


Understanding and Using Discrete Distributions
How to Test Your Discrete Distribution
You Might Also Like:
Guest Post: Pruning Your Hypothesis Testing Decision Tree
How to Correct Case Mismatches from Excel in Minitab, Fast
When It’s Easier to Open Data in Minitab than in Excel
ANCOVA and Blocking: 2 Vital Parts to DOE
How to Test Your Discrete Distribution
Minitab Blog Editor 13 December, 2012



Share
In my last post we looked at different discrete distributions and how you can use them. This
time, I’ll show you how to determine whether your data follow a specific discrete distribution.
(Read here to see how to identify the distribution of your continuous data.)
Before we start testing discrete distributions, we need to distinguish between two general cases.
In some cases, it is more important to:


Check the assumptions (binary data)
Perform a goodness-of-fit test
Checking Assumptions for Distributions that Use
Binary Data
For the distributions of binary data, you primarily need to determine whether your data satisfy
the assumptions for that distribution. If you satisfy the assumptions, you can use the distribution
to model the process.
As an example, we’ll walk through the assumptions for the binomial distribution. The binomial
distribution has the following four assumptions:
1. Each trial has one of two outcomes: This can be pass or fail, accept or reject, etc.
2. Each trial is independent: A trial in an experiment is independent if the likelihood of each
possible outcome does not change from trial to trial. For example, if you toss a coin 50 times,
each coin toss is an independent trial, because the outcome of one toss (heads or tails) does not
affect the likelihood of getting a heads or tails on the next toss.
3. The probability of an event is the same for each trial: The probability doesn’t change over time.
Sometimes you can make this assumption because of the physical properties that are involved,
such as flipping a coin. Other times, you may want to use the P Chart to confirm this
assumption. If the P Chart is in control, the probability is constant.
4. The number of trials is fixed: This assumption reflects your goal that you want to model how
frequently the event occurs over a constant number of trials.
Generally, determining whether your data satisfy these assumptions relies on a close
understanding of the process, data collection procedure, and your goals for the data. If you
satisfy all of these assumptions, you can safely use the binomial distribution.
Besides the binomial distribution, there are three other distributions in Minitab statistical
software that use binary data. They each have somewhat different assumptions than those listed
for the binomial distribution.
Distribution
Negative
binomial
Geometric
Hypergeometric
Key difference from binomial
You want to model the number of trials to produce a fixed number
of events
You want to model the number of trials to produce the first event
The probability changes overtime as you draw a sample from a
small population without replacement
In short, if you have binary data, the choice of which binary distribution you should use depends
on the population, the stability of the proportion, and what you want to do with the data. After
you confirm the assumptions, you generally don’t need to perform a goodness-of-fit test.
Performing a Goodness-of-Fit Test
If you suspect that your data follow the Poisson distribution or a distribution based on categorical
data, you should perform a goodness-of-fit test to determine whether your data follow a specific
distribution. These tests compare the observed values to theoretical values to determine whether
there is a significant difference. We’ll walk through some examples so you can see how easy it is
to perform these tests. You can get the data here.
Poisson distribution
If you want to determine whether your data follow the Poisson distribution, Minitab has a test
specifically for this distribution. To recap, the Poisson distribution describes a count of a
characteristic (e.g., defects) over a constant observation space, such as the number of scratches
on a windshield.
Accident count example
An insurance agent wants to monitor the number of accidents per month at a particular
intersection. The agent records the number of accidents in the worksheet like this:
Each cell in the worksheet represents the number of accidents over one month.
In Minitab, use the Goodness-of-Fit Test for Poisson in the Stat > Basic Statistics menu. In the
dialog box, in Variable, enter Accidents, and click OK.
The p-value is 0.470, which is greater than the common alpha level of 0.05. This result suggests
that these data follow the Poisson distribution and can be used with analyses that make this
assumption. These analyses include the 1- and 2-sample Poisson rate analyses, the U Chart, and
the Laney U’ Chart.
Categorical distributions
You can test distributions that are based on categorical data in Minitab using the Chi-Square
Goodness-of-Fit Test, which is similar to the Poisson Goodness-of-Fit Test. However, because
Minitab doesn’t know the distribution, you need to specify the test proportions yourself.
Car color example
We’ll run through an example using the proportions of car colors from my previous blog. In this
example, the global proportions reported by PPG Industries are real, while the observations we
“gathered” are for illustrative purposes only.
Suppose we want to determine whether the distribution of car colors in our state match the global
distribution. To do this, we have observers around the state record the colors of cars that were
manufactured in 2012 and included in a random sample. We tally up the colors and enter the
global proportions in the worksheet like this:
The values in the OurState column represent the tally for each color in our sample. The global
proportions are the values reported by PPG Industries.
In Minitab, go to Stat > Tables > Chi-Square Goodness-of-Fit Test (One Variable). In this
dialog, enter OurState in Observed counts; enter Color in Category names; and, under Test,
choose Specific proportions and enter Global Proportions. Click OK.
Minitab checks to see if the observed counts differ from the global distribution. A low p-value
suggests that your data do not follow that distribution. In this case, the p-value is 0.012, which
suggests that the distribution of car colors in our state does not match the global distribution.
You can compare the Observed and Expected columns in the table to see where the largest
differences occur, or look at the default graphs below.
The graph above shows which colors statistically contribute the most to the significant
difference. Gray and red contribute the most, more than half. However, the graph doesn’t show
whether the observations are higher or lower than the expected value. The next graph addresses
that.
Look at the "Gray" and "Red" bars in the graph above. The observed count of gray cars is greater
than the expected count. Conversely, the observed count of red cars is less than the expected
count.
Closing Thoughts
We’ve covered a variety of discrete data and how you should test it before you use a discrete
distribution to model it. In order to determine how to proceed with your discrete data, you first
need to determine what type it is, or suspect that it is. To quickly summarize:



Binary data: check the assumptions
Poisson data: use the Poisson Goodness-of-Fit Test
Other categorical data: use the Chi-Square Goodness-of-Fit Test and specify the test proportions
You Might Also Like:
Guest Post: Pruning Your Hypothesis Testing Decision Tree
How to Correct Case Mismatches from Excel in Minitab, Fast
Critical Checks for Pharmaceuticals and Healthcare: Validating Your Data Integrity and Process
Performance
Using My Fitness Tracker's Data To Get A Superhero-Inspired Physique
The Graphical Benefits of Identifying the
Distribution of Your Data
Minitab Blog Editor 22 March, 2012



Share
In my previous post, we identified the distribution of the body fat data. Today, we're going to explore
several benefits of knowing the distribution, with a special emphasis on creating informative graphs!
After all, if you are not sure what a specific distribution with such and such parameters looks like, a
graph gives you the picture!
Using the Distribution Information
So far, we have identified the distribution and the parameter values for the body fat data from 14year-old girls.
3-Parameter Weibull Distribution:



Shape = 1.85718
Scale = 14.07043
Threshold = 16.06038
How does that help us? What does this even look like? And where do important health ranges fall
within this distribution? You can't tell just by looking at the parameter values. However, I'll answer all
of these questions with just one cool graph!
It is always a good practice to know the distribution of your data before analyzing them. Certain
analyses require certain distributions. For example, it could be a costly mistake to use an analysis
that strictly requires a normal distribution with nonnormal data. However, I'm not going to focus on
choosing alternative analyses. Instead, I'll focus on the graphs and what we can do just by knowing
the distribution.
Because we have identified the best-fitting distribution, we are no longer limited to graphing the raw
sample data, like we did with the histogram. We can now make inferences about the population. We
can graph the best estimate for what the entire population looks like and calculate probabilities for
values that fall in certain ranges. So, let’s do that.
Probability Distribution Plot
To answer all of our questions, we'll use Minitab's Probability Distribution Plot. I’m a huge fan of
these plots. If you want to show your boss what an unusual distribution with inscrutable parameter
names actually looks like, use this graph. You can highlight the effect of changing distributions and
parameter values, show where target values fall in a distribution, and view the proportions that are
associated with important regions. These simple plots clearly and easily communicate these
advanced concepts to a non-statistical audience.
Probability Distribution Plots don't use any data. Instead, you specify the distribution and enter the
parameter values. You can also specify regions of interest to you.
We'll use the population parameters that we've already identified. For our region of interest, I found
a Web site that recommends that girls between the ages of 14-19 should have a body fat percentage
between 20%-24% for health reasons. That range sounds very tight to me, but let’s see where it falls
in our population distribution for 14-year-old girls.
In Minitab, I’ll go to Graph > Probability Distribution Plot > View Probability and enter our
distribution information in the main dialog like this:
Then, I’ll click the Shaded Area tab and fill it out like this:
After we click OK, Minitab displays the following graph:
All in one shot you can see both the shape of the distribution and how a range of interest fits within
it. I’m no health expert, but I can see that the Web site's range for ideal body fat percentages doesn’t
reflect where most of the girls fall. Only 20% fall within the ideal range and it falls below the curve's
peak. Already, we know something interesting is going on.
Probability Plots to Calculate Percentiles
Probability Plots have a similar name to Probability Distribution Plots. They are related, but
Probability Plots are particularly good at determining whether the data fit a distribution (check, we did
that already) and calculating percentiles based on that distribution. In general, the nth percentile has
n% of the population below it, and (100-n)% of the population above it.
Percentiles are extra important for nonnormal distributions because you use them to find the center
and spread of your distribution. Here's why.
Intuitively we think of the mean and standard deviation as the center and spread for a normal
distribution. Further, a good rule of thumb for normal distributions is that two-thirds of the population
falls symmetrically within 1 standard deviation from the mean. About 95% fall within 2 standard
deviations.
However, none of this is true for non-symmetric distributions. The mean is not at the center and the
general rule of thumb for the spread no longer works. However, once you identify your distribution
you can calculate percentiles in order to find the center and spread of the population.
For example, if you want to find the middle value (median) and the range in which the middle 95% of
a nonnormal population falls, calculate the 2.5th, 50th, and 97.5th percentiles (97.5 - 2.5 = 95). The
median is the 50th percentile; half of the population are above the median and half are below.
We'll calculate the body fat percentages that correspond to the 2.5th, 50th, and 97.5th percentiles. Also,
let's see what percentile corresponds to the upper limit of the supposed ideal body fat range: 24%.
To do this, you'll need to open the data, which you can find here.
1.
2.
3.
4.
In Minitab, go to Graph > Probability Plot > Single.
In the main dialog, enter %Fat as the Graph Variable.
Click the Distribution button and choose 3-parameter Weibull. Click OK.
Click the Scale button, and uncheck the Adjust x-scale for threshold . . . checkbox. This
produces a curved distribution fit line but allows the percentiles to be read straight off the
graph.
5. Still under Scale, click the Percentile Lines tab, and fill it out as shown below to produce our
desired percentiles. Click OK in all dialogs.
We get the following graph:
We already knew that these data follow this distribution from before, and the output reconfirms it.
The data points follow the center line and the p-value in the legend is greater than 0.500, which is
greater than any common alpha value. Hence, these data follow the 3-parameter Weibull
distribution.
In the graph, the data values are on the X-axis and the percentiles are on the Y-axis. For this
population, the 50th percentile (the median) corresponds to a body fat percentage of 27.6%. 95% of
the population should fall between the 2.5th and 97.5th percentiles, which correspond to 18.0% and
44.5% body fat. Because of the non-symmetric shape of the distribution, the median (27.6) is closer
to the low value than the high value.
24% body fat corresponds to the 29th percentile. 24% is the top end of the ideal range recommended
by the Web site but it is a fairly low percentile for this population. Said another way, 71% of the
population exceeds the upper limit of the range. Yikes!
Closing Thoughts
For the issue relating to the ideal body fat range, it's fairly clear that something is going on here. I'm
not a health expert, so I don't know the answer. However, it appears that either the range is incorrect
or a large majority (71%) of 14-year-old girls exceed the recommended range. Only 20% actually fall
within the range. However, with a few simple tools in Minitab, we have brought the implications of
these data to life! Just as important, we can easily present these results to others in an easy-tounderstand manner.
I hope after reading this you're more comfortable with nonnormal distributions and can see the
advantages of identifying your data's distribution. I’ve shown how you can transcend your raw
sample data and make useful inferences about the larger population that your data represent. You
can safely embrace your nonnormal data!
You Might Also Like:
Guest Post: Pruning Your Hypothesis Testing Decision Tree
How to Correct Case Mismatches from Excel in Minitab, Fast
When It’s Easier to Open Data in Minitab than in Excel
ANCOVA and Blocking: 2 Vital Parts to DOE
Download