qq-plot-75-925 - Quality and Innovation

advertisement
Normal Probability Plot (QQ Plot)
Objective
To help you get a quick sense of whether or not the data you have collected are
normally distributed! This is significant because if you have collected some data that are
nearly normal (e.g. test scores), you can use the normal model and z-scores to answer
questions like "What percentage of students performed better/worse than me?"
Background
The normal probability plot, also called a QQ Plot (short for quantile-quantile plot) gives
you a quick visual diagnostic that reveals whether the data are distributed normally - or
not. This plot compares each point in your data set to where those points would be in an
idealized, perfectly normal distribution with the same mean and standard deviation as
your data set (called the "theoretical quantiles"). So on this plot, we're looking for a
general pattern of linearity.






If all of the data points lie along the line or close to it, then each of the points
in your sample are close to where they would be in a perfectly normal
distribution - suggesting that your distribution is nearly normal!
If the distribution is skewed to the right (that is, if the tail of the distribution
stretches to the right), the data points will track linearly above the line on the
left hand side of the QQ plot, and then will veer sharply upward on the right
hand side of the plot. (The plot looks sort of like a backwards C.)
If the distribution is skewed to the left (that is, if the tail of the distribution
stretches to the left), the data points will veer sharply upward on the left hand
side of the QQ plot, and then will track linearly above the line on the right hand
side of the plot. (The plot looks sort of like an upside down C.)
If the data are bimodal, you'll see an "S" pattern.
When points fall above the line, it means "there are MORE data elements over
here in this region than we would expect... for example, outliers"
When points fall below the line, it means "there are FEWER data elements in
this part of the distribution than we would expect"
Let's say I've administered an exam, and I have a collection of test scores to examine.
They are stored in a text file called CAOS-Fall12.txt with one score per row (that's
all). I want to check and see if the distribution is normal. When I look at the distribution
of scores, it looks kind of normal, but it's skewed a little bit to the right, and so I'd like to
see whether it really is normal. I've loaded my data into R using the read.table
command, which creates a data frame, and hist to produce the histogram:
> test.scores <- read.table("CAOS-Fall12.txt",header=F)
> hist(test.scores$V1,breaks=10,xlim=c(0,100),ylim=c(0,10),col="gray")
Figure X.1: Histogram produced from test scores
Data Format
You can create a qqplot from a variable within a data frame, OR just from data that
you enter directly into an array. So any of these will work:
yourdata <- read.table("your-data-file.txt",header=F)
yourdata <- read.csv("your-data-file.csv",header=F)
yourdata <- c(1,2,3,4,5,6,7,8,9)
The top one assumes that your data set is in a text file and the top row does not contain
variable names (if it does, change header to equal T for true). The second one assumes
that your data set is in a CSV file and the top row does not contain variable names (if it
does, change header to equal T for true). The third one does not use a data file, but
instead just assumes that you'll replace the 1,2,3 and so forth with the data you actually
collected yourself.
Code and Results
Now that you know all the background, actually getting the normal probability plot to
display is super easy using R. If you've already loaded up your data and called
qqnorm(mydata)
qqline(mydata)
Here's how I produced a QQ plot using the test scores data above. The $V1 notation
indicates that I'm using the first variable in my data set. R gave it that name because I
didn't have a header (or top row) in my data file stating the names of my variables. The
results from executing both of these commands are shown in Figure X.2 - qqnorm
creates the plot and puts the data on it, while qqline superimposes the line.
> qqnorm(test.scores$V1)
> qqline(test.scores$V1)
Figure X.2: A normal probability plot (or QQ plot) for my test scores.
The normal probability plot in Figure X.2 has 50 data points on it, one for each test score
in the data set. Each point is a representation of the difference between the actual test
score and the test score we would have observed if that score came from a perfectly
normal distribution.
The question we have to ask when we're looking at the QQ Plot is this: "is there a
general pattern of linearity evident in this plot?" I'd say yes, because so many of the
data points are close to the diagonal line. Because the points on both the left hand side
and the right hand side of the QQ Plot are above the diagonal line, this might suggest
that the data are slightly skewed to the right (which we can see in the histogram).
You can also test your distribution against the null hypothesis that the data are
distributed normally. (This means that the alternative hypothesis would be that the data
are not distributed normally.) It is common to reject the null hypothesis that the data
are normal when the p-value is less than 0.5, so in this case, we can't quite say that our
test score distribution is not normal. But it's right on the edge.
> shapiro.test(test.scores$V1)
Shapiro-Wilk normality test
data: test.scores$V1
W = 0.9561, p-value = 0.06088
In contrast, here's what a "really linear" QQ plot looks like (Figure X.3). I created this
display using the code below. The rnorm command generates an idealized distribution.
> normal <- rnorm(500,46.85,11.31)
> par(mfrow=c(1,2))
> hist(normal,breaks=12,xlim=c(0,100),ylim=c(0,100),
main="N(46.85,11.31)",col="gray")
> qqnorm(normal);qqline(normal)
Figure X.3: A normal probability (QQ) plot for very normal data!
Figure X.4: Comparing normal probability (QQ) plots for unimodal and bimodal data.
Just to give you a sense of what a normal probability plot might look like for data that
are clearly not normally distributed, see Figure X.4. The top row shows a histogram and
a QQ plot for the original test score data. The bottom row shows test scores from a class
in a previous year, where the shape of the distribution was more bimodal, with peaks
around 40 and 65. The QQ plot has the "S" shape indicating a bimodal distribution.
Conclusions
The normal probability (or QQ) plot provides a quick way to get a feel for whether data
are distributed normally. (However, this method is not conclusive! If you really need to
know for sure whether your data are normally distributed or not, try the Shapiro-Wilk
test or the Anderson-Darling test.)
Other Resources:



http://en.wikipedia.org/wiki/Normal_probability_plot
http://www.pmean.com/09/NormalPlot.html -- this site provides many
examples of patterns you could see in normal probability plots, what they
indicate, and what you should be cautious about
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/shapiro.test.html - R
documentation for the Shapiro-Wilk test of normality
Download