13-f12-bgunderson-wb-module2 - Open.Michigan

advertisement
Author: Brenda Gunderson, Ph.D., 2012
License: Unless otherwise noted, this material is made available under the terms of the
Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License:
http://creativecommons.org/licenses/by-nc-sa/3.0/
The University of Michigan Open.Michigan initiative has reviewed this material in accordance
with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it.
The attribution key provides information about how you may share and adapt this material.
Copyright holders of content included in this material should contact
open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of
content.
For more information about how to attribute these materials visit:
http://open.umich.edu/education/about/terms-of-use. Some materials are used with permission
from the copyright holders. You may need to obtain new permission to use those materials for
other uses. This includes all content from:
Mind on Statistics
Utts/Heckard, 4th Edition, Cengage L, 2012
Text Only: ISBN 9781285135984
Bundled version: ISBN 9780538733489
SPSS and its associated programs are trademarks of SPSS Inc. for its proprietary
computer software. Other product names mentioned in this resource are used for identification
purposes only and may be trademarks of their respective companies.
Attribution Key
For more information see: http:://open.umich.edu/wiki/AttributionPolicy
Content the copyright holder, author, or law permits you to use, share and adapt:
Creative Commons Attribution-NonCommercial-Share Alike License
Public Domain – Self Dedicated: Works that a copyright holder has
dedicated to the public domain.
Make Your Own Assessment
Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for
copyright.
Public Domain – Ineligible. WOrkds that are ineligible for copyright
protection in the U.S. (17 USC §102(b)) *laws in your jurisdiction may
differ.
Content Open.Michigan has used under a Fair Use determination
Fair Use: Use of works that is determined to be Fair consistent with the
U.S. Copyright Act (17 USC § 107) *laws in your jurisdiction may differ.
Our determination DOES NOT mean that all uses of this third-party content are Fair Uses and
we DO NOT guarantee that your use of the content is Fair. To use t his content you should
conduct your own independent analysis to determine whether or not your use will be Fair.
Module 2: Sequence Plots and Q-Q Plots
Objective: In this module, you will add to your set of graphical tools for examining
data. The graphs you will examine include sequence (time) plots for data collected
over time and Q-Q plots for checking whether a normal model is a reasonable
distribution for a quantitative variable.
Overview: Module 1 provided a summary of some graphical and numerical tools
that can be used to summarize the distribution for quantitative and qualitative
variables or responses. We may use those tools for the activities in this module,
but we will also need to utilize the new tools described below. Note that these
graphical tools are introduced solely in lab, not in lecture, so it will benefit you to
read this overview thoroughly.
Sequence (Time) Plots: Data is often gathered over time. Employment rate, stock
prices, and sales figures are just a few examples. When data is gathered over time,
it is generally wise to examine the data plotted against time. Plots against time
can reveal the main features of a time series, overall patterns and striking
deviation from those patterns. Some overall patterns that may arise are:
b. A persistent, long-term rise or fall called a trend (either increasing
or decreasing).
c. A pattern that repeats itself at regular intervals of time called
seasonal variation.
d. A persistent, long-term increase or decrease in the variation of the
observations called a pattern in variation.
If data is collected over time, a sequence plot can be used to check the assumption
of a random sample, which will be needed for inference procedures. As you have
learned in your Chapter 5 lecture notes on Sampling, a random sample consists of
independent and identically distributed (i.i.d.) observations. This means that the
observations can be considered as all coming from the same parent population
(with the same or identical distribution) and are independent of one other. With a
sequence plot, you can check the identically distributed aspect of a random sample
by looking for evidence of stability in the plot. Stability is supported when both
the mean of the observations and the amount of variation among observations
appear to be constant over time and there does not appear to be any pattern in
the resulting plot.
35
Below are two sequence plots; in the first plot the observations appear to support
that the underlying process that generated the observations is stable, but that is
not the case for the observations in the second plot on the right. In this case, there
appears to be an increasing trend, thus the underlying process does not appear to
be stable; the observations should not be considered a random sample.
36
Q-Q Plots: Later in this class, we will see that the assumption of a normal model
for a population of responses will be needed in order to perform certain
inference procedures. Previously, we have seen that a histogram can be used to
get an idea of the shape of a distribution. However, there are more sensitive tools
for checking whether the shape is close to a normal (bell-shaped) model. The best
plot that can be used to check for normality is called a Q-Q Plot, which is a plot of
the percentiles (or quantiles) of a standard normal distribution against the
corresponding percentiles of the observed data. If the observations follow an
approximately normal distribution, the resulting plot should be roughly a straight
line with a positive slope. Deviations from this indicate possible departures from a
normal distribution. Below is an example of a Q-Q Plot showing data that does
seem to come from a population with an approximately normal distribution.
b
37
The three graphs below are examples for which a normal model for the response is
not reasonable.
The Q-Q plot on the top left indicates the existence of two clusters of observations.
The Q-Q plot on the top right shows an example where the shape of the
distribution appears to be skewed right. The Q-Q plot on the bottom left shows
evidence of an underlying distribution that has shorter tails compared to those of a
normal distribution.
Note: It is only important that you can see the departures in the above graphs and
not as important to know if the departure implies skewed left versus skewed right
and so on. A histogram would allow you to see the shape and type of departure
from normality.
38
Finally, we consider an example Q-Q plot that appears normal with the exception
of one data point. In this case, we would say the Q-Q plot shows evidence of an
underlying distribution which is approximately normal except for one large outlier
that should be further investigated. Note that outliers could appear in either the
upper or lower tail.
39
Activity 1: Time-Dependent Data
Background 1: The data set deathrate.sav contains the death rate (number of
deaths per 100 million miles driven) taken at two-year intervals from 1960 to 2004.
Task 1: Display and summarize this data in an appropriate and useful way. What
do you see? Would it make sense to make a histogram of the death rates? The
following steps will guide your thinking as you complete this task.
1. Once the data set is open, swap from Data View to Variable View. Here, you
can set the number of decimal places for each variable, give them longer
labels, and input coding for categorical variables. What is the extended label
description of the variable Rate?
2. Why should a sequence plot be made to display this data?
3. Make a sequence plot for the data using Analyze> Forecasting> Sequence
Charts. What does the graph show? Comment on if you see any trend,
seasonal variation, or pattern in variation in this graph.
4. Does the plot appear to be stable? What would you conclude if asked if the
data were a random sample of death rates?
5. Would it make sense to make a histogram of the death rates? Why or why
not?
40
Background 2: The data set oldfaithful.sav contains the date and duration of
eruptions (in minutes) of the Old Faithful geyser. The data was collected several
times per day over 23 consecutive days.
Task 2: Display and summarize the data in an appropriate and useful way. What
do you see? Does there appear to be any pattern to this process? The following
steps will guide your thinking as you complete this task.
1. Make a time plot for the data using Analyze> Forecasting> Sequence Charts.
What does the graph show? Are there any patterns to the process?
2. Does the plot appear to be stable? What would you conclude if asked if the
data were a random sample of eruptions?
Check Your Understanding:
Circle the appropriate word(s) to complete the sentences.
We use sequence (or time) plots to check the
independent identically distributed part of the random sample
assumption by looking to see if the data appear to be stable normal ,
that is, have a constant mean and constant variation over time. If there is
any pattern in the observations over time, we should should not
make a histogram of the observations for further analysis.
One important type of time-dependent data is seasonal data (data that shows a
pattern that corresponds to seasons of the year). Show below what a sequence
plot of seasonal data would look like. Include your labels.
41
Activity 2: Checking Normality
Background: You have discussed using a histogram to examine the shape of the
distribution of a quantitative variable.
If the histogram shows a fairly
homogeneous set of observations, we might like to assess whether a normal
distribution is a reasonable model for the response. A better graph for assessing
normality is a Q-Q plot. In this problem, we will examine a few distributions and
see what each corresponding Q-Q plot looks like.
Task 1: Suppose a study examined high school students and the relationship
between IQ and GPA. Use the iq.sav dataset and examine the distribution of IQ.
Create a histogram and a Q-Q plot for the IQ values. Q-Q plots are created via
Analyze> Descriptive Statistics> Q-Q plots. You may provide a rough sketch the
graphs below.
Histogram:
Q-Q Plot:
1. Describe the shape of the resulting histogram.
2. Is a normal distribution a reasonable model for IQ scores in the population
based on this Q-Q plot?
42
Task 2: We have previously explored the employee data.sav dataset variable of
salary. Now, let’s check to see if the overall distribution of salary can be
considered normal, and then to see if the distribution of salary might be normal
depending on minority status.
1. Create a histogram for the variable salary, and describe its shape.
2. Create a Q-Q plot for salary using Analyze> Descriptive Statistics> Q-Q Plots.
Based on the evidence of these graphs, is a normal distribution an appropriate
model for current salary? Why or why not?
3. Create histograms and Q-Q plots separately for minorities and non-minorities
(recall the Data> Split File command). Does the distribution of salary appear
to be different for either group? Comment on both the histograms and Q-Q
plots.
4. For salary, is a normal model reasonable for either minorities or nonminorities?
43
Check Your Understanding:
Match the corresponding histograms, boxplots, and Q-Q plots.
Histogram A
Histogram B
Histogram C
Boxplot A
Boxplot B
Boxplot C
QQ Plot A
QQ Plot B
QQ Plot C
Which type of graph best shows the shape of the underlying distribution?
Which type of graph best shows if the underlying distribution appears to be normal
(bell-curve)?
44
Example Exam Question on Sequence Plots and Q-Q Plots
A new method of measuring phosphorus levels in soil is under consideration. A
sample of 11 soil specimens is analyzed using the new method. The time series
(sequence plot) for the 11 observations is presented below.
a. Comment on the overall stability of these data based on this plot.
660
640
620
600
580
PHSLEV
560
540
520
1
2
3
4
5
6
7
8
9
10
11
Sequence number
b. An assumption of many statistical inference methods is that the data follow a
normal distribution. In the space provided below, sketch how the Q-Q plot
would appear if a normal distribution was a good model for phosphorus levels.
45
Download