Normal Probability Plot (QQ Plot) Objective To help you get a quick sense of whether or not the data you have collected are normally distributed! This is significant because if you have collected some data that are nearly normal (e.g. test scores), you can use the normal model and z-scores to answer questions like "What percentage of students performed better/worse than me?" Background The normal probability plot, also called a QQ Plot (short for quantile-quantile plot) gives you a quick visual diagnostic that reveals whether the data are distributed normally - or not. This plot compares each point in your data set to where those points would be in an idealized, perfectly normal distribution with the same mean and standard deviation as your data set (called the "theoretical quantiles"). So on this plot, we're looking for a general pattern of linearity. If all of the data points lie along the line or close to it, then each of the points in your sample are close to where they would be in a perfectly normal distribution - suggesting that your distribution is nearly normal! If the distribution is skewed to the right (that is, if the tail of the distribution stretches to the right), the data points will track linearly above the line on the left hand side of the QQ plot, and then will veer sharply upward on the right hand side of the plot. (The plot looks sort of like a backwards C.) If the distribution is skewed to the left (that is, if the tail of the distribution stretches to the left), the data points will veer sharply upward on the left hand side of the QQ plot, and then will track linearly above the line on the right hand side of the plot. (The plot looks sort of like an upside down C.) If the data are bimodal, you'll see an "S" pattern. When points fall above the line, it means "there are MORE data elements over here in this region than we would expect... for example, outliers" When points fall below the line, it means "there are FEWER data elements in this part of the distribution than we would expect" Let's say I've administered an exam, and I have a collection of test scores to examine. They are stored in a text file called CAOS-Fall12.txt with one score per row (that's all). I want to check and see if the distribution is normal. When I look at the distribution of scores, it looks kind of normal, but it's skewed a little bit to the right, and so I'd like to see whether it really is normal. I've loaded my data into R using the read.table command, which creates a data frame, and hist to produce the histogram: > test.scores <- read.table("CAOS-Fall12.txt",header=F) > hist(test.scores$V1,breaks=10,xlim=c(0,100),ylim=c(0,10),col="gray") Figure X.1: Histogram produced from test scores Data Format You can create a qqplot from a variable within a data frame, OR just from data that you enter directly into an array. So any of these will work: yourdata <- read.table("your-data-file.txt",header=F) yourdata <- read.csv("your-data-file.csv",header=F) yourdata <- c(1,2,3,4,5,6,7,8,9) The top one assumes that your data set is in a text file and the top row does not contain variable names (if it does, change header to equal T for true). The second one assumes that your data set is in a CSV file and the top row does not contain variable names (if it does, change header to equal T for true). The third one does not use a data file, but instead just assumes that you'll replace the 1,2,3 and so forth with the data you actually collected yourself. Code and Results Now that you know all the background, actually getting the normal probability plot to display is super easy using R. If you've already loaded up your data and called qqnorm(mydata) qqline(mydata) Here's how I produced a QQ plot using the test scores data above. The $V1 notation indicates that I'm using the first variable in my data set. R gave it that name because I didn't have a header (or top row) in my data file stating the names of my variables. The results from executing both of these commands are shown in Figure X.2 - qqnorm creates the plot and puts the data on it, while qqline superimposes the line. > qqnorm(test.scores$V1) > qqline(test.scores$V1) Figure X.2: A normal probability plot (or QQ plot) for my test scores. The normal probability plot in Figure X.2 has 50 data points on it, one for each test score in the data set. Each point is a representation of the difference between the actual test score and the test score we would have observed if that score came from a perfectly normal distribution. The question we have to ask when we're looking at the QQ Plot is this: "is there a general pattern of linearity evident in this plot?" I'd say yes, because so many of the data points are close to the diagonal line. Because the points on both the left hand side and the right hand side of the QQ Plot are above the diagonal line, this might suggest that the data are slightly skewed to the right (which we can see in the histogram). You can also test your distribution against the null hypothesis that the data are distributed normally. (This means that the alternative hypothesis would be that the data are not distributed normally.) It is common to reject the null hypothesis that the data are normal when the p-value is less than 0.5, so in this case, we can't quite say that our test score distribution is not normal. But it's right on the edge. > shapiro.test(test.scores$V1) Shapiro-Wilk normality test data: test.scores$V1 W = 0.9561, p-value = 0.06088 In contrast, here's what a "really linear" QQ plot looks like (Figure X.3). I created this display using the code below. The rnorm command generates an idealized distribution. > normal <- rnorm(500,46.85,11.31) > par(mfrow=c(1,2)) > hist(normal,breaks=12,xlim=c(0,100),ylim=c(0,100), main="N(46.85,11.31)",col="gray") > qqnorm(normal);qqline(normal) Figure X.3: A normal probability (QQ) plot for very normal data! Figure X.4: Comparing normal probability (QQ) plots for unimodal and bimodal data. Just to give you a sense of what a normal probability plot might look like for data that are clearly not normally distributed, see Figure X.4. The top row shows a histogram and a QQ plot for the original test score data. The bottom row shows test scores from a class in a previous year, where the shape of the distribution was more bimodal, with peaks around 40 and 65. The QQ plot has the "S" shape indicating a bimodal distribution. Conclusions The normal probability (or QQ) plot provides a quick way to get a feel for whether data are distributed normally. (However, this method is not conclusive! If you really need to know for sure whether your data are normally distributed or not, try the Shapiro-Wilk test or the Anderson-Darling test.) Other Resources: http://en.wikipedia.org/wiki/Normal_probability_plot http://www.pmean.com/09/NormalPlot.html -- this site provides many examples of patterns you could see in normal probability plots, what they indicate, and what you should be cautious about http://stat.ethz.ch/R-manual/R-patched/library/stats/html/shapiro.test.html - R documentation for the Shapiro-Wilk test of normality