Module 2.5: Sequence Plots and QQ Plots

advertisement
Author(s): Brenda Gunderson, Ph.D., 2011
License: Unless otherwise noted, this material is made available under the
terms of the Creative Commons Attribution–Non-commercial–Share
Alike 3.0 License: http://creativecommons.org/licenses/by-nc-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your
ability to use, share, and adapt it. The citation key on the following slide provides information about how you
may share and adapt this material.
Copyright holders of content included in this material should contact open.michigan@umich.edu with any
questions, corrections, or clarification regarding the use of content.
For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use.
Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis
or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please
speak to your physician if you have questions about your medical condition.
Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.
Some material may be sourced from:
Mind on Statistics
Utts/Heckard, 3rd Edition, Duxbury, 2006
Text Only: ISBN 0495667161
Bundled version: ISBN 1111978301
Material from this publication used with permission.
Attribution Key
for more information see: http://open.umich.edu/wiki/AttributionPolicy
Use + Share + Adapt
{ Content the copyright holder, author, or law permits you to use, share and adapt. }
Public Domain – Government: Works that are produced by the U.S. Government. (17 USC §
105)
Public Domain – Expired: Works that are no longer protected due to an expired copyright term.
Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain.
Creative Commons – Zero Waiver
Creative Commons – Attribution License
Creative Commons – Attribution Share Alike License
Creative Commons – Attribution Noncommercial License
Creative Commons – Attribution Noncommercial Share Alike License
GNU – Free Documentation License
Make Your Own Assessment
{ Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. }
Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in
your jurisdiction may differ
{ Content Open.Michigan has used under a Fair Use determination. }
Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your
jurisdiction may differ
Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that
your use of the content is Fair.
To use this content you should do your own independent analysis to determine whether or not your use will be Fair.
Module 2: Sequence Plots and Q-Q Plots
Objective: In this module, you will add to your set of graphical tools for examining data. The graphs you
will examine include sequence (time) plots for data collected over time and Q-Q plots for checking
whether a normal model is a reasonable distribution for a quantitative variable.
Overview: Module 1 provided a summary of some graphical and numerical tools that can be used to
summarize the distribution for quantitative and qualitative variables or responses. We may use those
tools for the activities in this module, but we will also need to utilize the new tools described below.
Note that these graphical tools are introduced solely in lab, not in lecture, so it will benefit you to read
this overview thoroughly.
Sequence (Time) Plots: Data is often gathered over time. Employment rate, stock prices, and sales
figures are just a few examples. When data is gathered over time it is generally wise to examine the data
plotted against time. Plots against time can reveal the main features of a time series, overall patterns and
striking deviation from those patterns. Some overall patterns that may arise are:



A persistent, long-term rise or fall called a trend (either increasing or decreasing).
A pattern that repeats itself at regular intervals of time called seasonal variation.
A persistent, long-term increase (or decrease) in the variation of the observations. This is called
a pattern in variation.
If data is collected over time, a sequence plot can be used to check the assumption of a random sample,
which will be needed for inference procedures. As you have learned, random samples consist of
independent and identically distributed (i.i.d.) observations. This means that the observations all come
from the same parent population and are independent of one other. With a sequence plot, you can
check the identically distributed aspect of a random sample by looking for evidence of stability in the
plot. Stability is supported when both the mean of the observations and the amount of variation among
observations appear to be constant over time. Below are two example sequence plots; the first appears
to be stable while the second does not appear stable.
26
Q-Q Plots: Later in this class we will see that the assumption of a
normal model for a population of responses will be needed in order to
perform certain inference procedures. Previously, we have seen that a
histogram can be used to get an idea of the shape of a distribution.
However, there are more sensitive tools for checking if the shape is
close to a normal (bell-shaped) model. A good plot that can be used to
check for normality is called a Q-Q Plot, which is a plot of the
percentiles (or quantiles) of a standard normal distribution against the
corresponding percentiles of the observed data. If the observations
follow approximately a normal distribution, the resulting plot should be
roughly a straight line with a positive slope. Deviations from this would
indicate possible departures from a normal distribution. At the right is
an example of a Q-Q Plot showing data that does seem to come from a
population with an approximately normal distribution.
The three graphs below are examples for which a normal model for the response is not reasonable.
The Q-Q plot above left indicates the existence of two clusters of observations. The Q-Q plot above
center shows an example where the shape of the distribution appears to be skewed right. The Q-Q plot
above right shows evidence of an underlying distribution that has shorter tails compared to those of a
normal distribution.
Finally, we consider an example Q-Q plot that appears normal with the exception of one data point. In
this case, we would say the Q-Q plot shows evidence of an underlying distribution which is
approximately normal except for one large outlier that should be further investigated. Note that outliers
could appear in either the upper or lower tail.
Note: It is most important that you can see the
departures in the above graphs and not as important to
know if the departure implies skewed left versus
skewed right and so on. A histogram would allow you
to see the shape and type of departure from normality.
27
28
Activity 1: Time-dependent Data
Background 1: Below is the death rate (number of deaths per 100 million miles driven) provided for a
number of years starting with 1960 and going through 2004.
Year
Rate
1960
5.1
1962
5.1
1964
5.4
1966
5.5
1968
5.2
1970
4.7
1972
4.2
1974
3.5
1976
3.3
Year
Rate
1978
3.3
1980
3.3
1982
2.8
1984
2.6
1986
2.5
1988
2.4
1990
2.5
1992
2.3
1994
1.8
Year
Rate
1996
1.8
1998
1.7
2000
1.5
2002
1.5
2004
1.6
Task 1: Display and summarize this data in an appropriate and useful way. What do you see? Would it
make sense to make a histogram of the death rates? The following steps will guide your thinking as you
complete this task.
1. Input this data in SPSS. To enter data in SPSS, open a new file. To set up variables, swap from Data
View to Variable View. In this view, you can set the number of decimal places for each variable, give
them longer labels, and input coding for categorical variables. Return to the Data View when you are
ready to enter the data. Inputting data works just like an Excel spreadsheet.
2. Why should a sequence plot be made to display this data in a useful way?
3. Make a sequence plot for the data using Analyze>Forecasting>Sequence Charts.
What does the graph show? Comment on if you see any trend, seasonal variation, or pattern in
variation in this graph.
4. Does the plot appear to be stable? What would you conclude if asked if the data were a random
sample of death rates?
5. Would it make sense to make a histogram of the death rates? Why or why not?
29
Background 2: The data set oldfaithful.sav contains the date and duration of eruptions (in minutes) of
the Old Faithful geyser. The data was collected several times per day over 23 consecutive days.
Task 2: Display and summarize the data in an appropriate and useful way. What do you see? Does there
appear to be any pattern to this process? The following steps will guide your thinking as you complete
this task.
1. Make a time plot for the data using Analyze> Time Series> Sequence Charts. What does the graph
show? Are there any patterns to the process?
2. Does the plot appear to be stable? What would you conclude if asked if the data were a random
sample?
Check Your Understanding:
Circle the appropriate word(s) to complete the sentences.
We use sequence (or time) plots to check the
independent
identically distributed
part of the random sample assumption by looking to see if the data appear to be
stable
normal ,
that is, have a constant mean and constant variation over time.
If there is any pattern in the observations over time, we
should
should not
make a histogram of the observations for further analysis.
One important type of time-dependent data is seasonal data (data that shows a pattern that
corresponds to seasons of the year). Show below what a sequence plot of seasonal data would look
like. Include your labels.
30
Activity 2: Checking Normality
Background 1: You have discussed using a histogram to examine the shape of the distribution of a
quantitative variable. If the histogram shows a fairly homogeneous set of observations, we might like to
assess if a particular distribution, called the normal distribution, is a reasonable model for the response.
A better graph for assessing normality is a Q-Q plot. In this problem we will examine a few distributions
and see what each corresponding Q-Q plot looks like.
Task 1: Suppose a study examined high school students and the relationship between IQ and gpa.
Use the iq.sav dataset and examine the distribution of IQ.
Create a histogram and a Q-Q plot for the IQ values. Q-Q plots are created via Analyze>Descriptive
Statistics> Q-Q plots. You may sketch the graphs below.
1. Describe the shape of the resulting histogram.
2. When a bell-shaped, normal distribution is reasonable, the points on a Q-Q plot will approximately
follow a straight line with positive slope. This line shows where points should fall if the sample
quantiles are equal to the theoretical quantiles. Is a normal distribution a reasonable model for IQ
scores in the population based on this Q-Q plot?
Task 2: We have previously explored the Employee data.sav dataset variable of SALARY. Now, let’s
check to see if the overall distribution of SALARY can be considered normal, and then to see if the
distribution of SALARY might be normal depending on the sex under consideration.
1. Use the Employee data.sav dataset and create the histogram for the variable SALARY (current
salary).
2. Describe the shape of the resulting histogram.
31
3. Create a Q-Q plot for SALARY using Analyze>Descriptive Statistics>Q-Q Plots. Based on the evidence
of these graphs, is a normal distribution an appropriate model for current salary? Why or why not?
4. Create histograms and Q-Q plots separately for each sex (recall the Data> Split File command). Does
the distribution of SALARY appear to be different for either sex? Comment on both the histograms
and Q-Q plots.
5. For salary, is a normal model reasonable for either sex?
Check Your Understanding:
Assume that a normal model is NOT reasonable for the distribution of
the waiting time at a bank. Using the best plot available for assessing
normality, draw an example plot that would indicate this.
Hint: For Q-Q plots, due to how they are constructed, points must be
increasing along the y-axis as you move right across the x-axis.
Notes about Saving SPSS data sets
(1) When you end your SPSS session, if you have not saved a data set that you entered or changed, a
dialog box will ask you "Do you want to save changes?”. Click on OK if you want to save the data set
at this point. A dialog box will appear asking you to name the data file, then click on Save. Click on
Don't Save if you don't want to save the data. Also note that you cannot save anything to a
"locked" disk.
(2) When you save a data file using the Save as... command, the dialog box that appears allows you to
designate where the file is to be saved. Click on the box at the top with the arrow in it until it shows
the name of the place where you want to save your file (usually on your own disk).
32
Example Exam Question on Sequence Plots and Q-Q Plots
A new method of measuring phosphorus levels in soil is under consideration. A sample of 11 soil
specimens is analyzed using the new method. The time series (sequence plot) for the 11 observations is
presented below.
a. Comment on the overall stability of these data based on this plot.
660
640
620
600
580
PHSLEV
560
540
520
1
2
3
4
5
6
7
8
9
10
11
Sequence number
b. An assumption of many statistical inference methods is that the data follow a normal distribution. In
the space provided below, sketch how the Q-Q plot (quantile plot) would appear if a normal
distribution was a good model for phosphorus levels.
33
Download