Author: Brenda Gunderson, Ph.D., 2012 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License: http://creativecommons.org/licenses/by-nc-sa/3.0/ The University of Michigan Open.Michigan initiative has reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The attribution key provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to attribute these materials visit: http://open.umich.edu/education/about/terms-of-use. Some materials are used with permission from the copyright holders. You may need to obtain new permission to use those materials for other uses. This includes all content from: Mind on Statistics Utts/Heckard, 4th Edition, Cengage L, 2012 Text Only: ISBN 9781285135984 Bundled version: ISBN 9780538733489 SPSS and its associated programs are trademarks of SPSS Inc. for its proprietary computer software. Other product names mentioned in this resource are used for identification purposes only and may be trademarks of their respective companies. Attribution Key For more information see: http:://open.umich.edu/wiki/AttributionPolicy Content the copyright holder, author, or law permits you to use, share and adapt: Creative Commons Attribution-NonCommercial-Share Alike License Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain. Make Your Own Assessment Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. Public Domain – Ineligible. WOrkds that are ineligible for copyright protection in the U.S. (17 USC §102(b)) *laws in your jurisdiction may differ. Content Open.Michigan has used under a Fair Use determination Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act (17 USC § 107) *laws in your jurisdiction may differ. Our determination DOES NOT mean that all uses of this third-party content are Fair Uses and we DO NOT guarantee that your use of the content is Fair. To use t his content you should conduct your own independent analysis to determine whether or not your use will be Fair. Module 2: Sequence Plots and Q-Q Plots Objective: In this module, you will add to your set of graphical tools for examining data. The graphs you will examine include sequence (time) plots for data collected over time and Q-Q plots for checking whether a normal model is a reasonable distribution for a quantitative variable. Overview: Module 1 provided a summary of some graphical and numerical tools that can be used to summarize the distribution for quantitative and qualitative variables or responses. We may use those tools for the activities in this module, but we will also need to utilize the new tools described below. Note that these graphical tools are introduced solely in lab, not in lecture, so it will benefit you to read this overview thoroughly. Sequence (Time) Plots: Data is often gathered over time. Employment rate, stock prices, and sales figures are just a few examples. When data is gathered over time, it is generally wise to examine the data plotted against time. Plots against time can reveal the main features of a time series, overall patterns and striking deviation from those patterns. Some overall patterns that may arise are: b. A persistent, long-term rise or fall called a trend (either increasing or decreasing). c. A pattern that repeats itself at regular intervals of time called seasonal variation. d. A persistent, long-term increase or decrease in the variation of the observations called a pattern in variation. If data is collected over time, a sequence plot can be used to check the assumption of a random sample, which will be needed for inference procedures. As you have learned in your Chapter 5 lecture notes on Sampling, a random sample consists of independent and identically distributed (i.i.d.) observations. This means that the observations can be considered as all coming from the same parent population (with the same or identical distribution) and are independent of one other. With a sequence plot, you can check the identically distributed aspect of a random sample by looking for evidence of stability in the plot. Stability is supported when both the mean of the observations and the amount of variation among observations appear to be constant over time and there does not appear to be any pattern in the resulting plot. 35 Below are two sequence plots; in the first plot the observations appear to support that the underlying process that generated the observations is stable, but that is not the case for the observations in the second plot on the right. In this case, there appears to be an increasing trend, thus the underlying process does not appear to be stable; the observations should not be considered a random sample. 36 Q-Q Plots: Later in this class, we will see that the assumption of a normal model for a population of responses will be needed in order to perform certain inference procedures. Previously, we have seen that a histogram can be used to get an idea of the shape of a distribution. However, there are more sensitive tools for checking whether the shape is close to a normal (bell-shaped) model. The best plot that can be used to check for normality is called a Q-Q Plot, which is a plot of the percentiles (or quantiles) of a standard normal distribution against the corresponding percentiles of the observed data. If the observations follow an approximately normal distribution, the resulting plot should be roughly a straight line with a positive slope. Deviations from this indicate possible departures from a normal distribution. Below is an example of a Q-Q Plot showing data that does seem to come from a population with an approximately normal distribution. b 37 The three graphs below are examples for which a normal model for the response is not reasonable. The Q-Q plot on the top left indicates the existence of two clusters of observations. The Q-Q plot on the top right shows an example where the shape of the distribution appears to be skewed right. The Q-Q plot on the bottom left shows evidence of an underlying distribution that has shorter tails compared to those of a normal distribution. Note: It is only important that you can see the departures in the above graphs and not as important to know if the departure implies skewed left versus skewed right and so on. A histogram would allow you to see the shape and type of departure from normality. 38 Finally, we consider an example Q-Q plot that appears normal with the exception of one data point. In this case, we would say the Q-Q plot shows evidence of an underlying distribution which is approximately normal except for one large outlier that should be further investigated. Note that outliers could appear in either the upper or lower tail. 39 Activity 1: Time-Dependent Data Background 1: The data set deathrate.sav contains the death rate (number of deaths per 100 million miles driven) taken at two-year intervals from 1960 to 2004. Task 1: Display and summarize this data in an appropriate and useful way. What do you see? Would it make sense to make a histogram of the death rates? The following steps will guide your thinking as you complete this task. 1. Once the data set is open, swap from Data View to Variable View. Here, you can set the number of decimal places for each variable, give them longer labels, and input coding for categorical variables. What is the extended label description of the variable Rate? 2. Why should a sequence plot be made to display this data? 3. Make a sequence plot for the data using Analyze> Forecasting> Sequence Charts. What does the graph show? Comment on if you see any trend, seasonal variation, or pattern in variation in this graph. 4. Does the plot appear to be stable? What would you conclude if asked if the data were a random sample of death rates? 5. Would it make sense to make a histogram of the death rates? Why or why not? 40 Background 2: The data set oldfaithful.sav contains the date and duration of eruptions (in minutes) of the Old Faithful geyser. The data was collected several times per day over 23 consecutive days. Task 2: Display and summarize the data in an appropriate and useful way. What do you see? Does there appear to be any pattern to this process? The following steps will guide your thinking as you complete this task. 1. Make a time plot for the data using Analyze> Forecasting> Sequence Charts. What does the graph show? Are there any patterns to the process? 2. Does the plot appear to be stable? What would you conclude if asked if the data were a random sample of eruptions? Check Your Understanding: Circle the appropriate word(s) to complete the sentences. We use sequence (or time) plots to check the independent identically distributed part of the random sample assumption by looking to see if the data appear to be stable normal , that is, have a constant mean and constant variation over time. If there is any pattern in the observations over time, we should should not make a histogram of the observations for further analysis. One important type of time-dependent data is seasonal data (data that shows a pattern that corresponds to seasons of the year). Show below what a sequence plot of seasonal data would look like. Include your labels. 41 Activity 2: Checking Normality Background: You have discussed using a histogram to examine the shape of the distribution of a quantitative variable. If the histogram shows a fairly homogeneous set of observations, we might like to assess whether a normal distribution is a reasonable model for the response. A better graph for assessing normality is a Q-Q plot. In this problem, we will examine a few distributions and see what each corresponding Q-Q plot looks like. Task 1: Suppose a study examined high school students and the relationship between IQ and GPA. Use the iq.sav dataset and examine the distribution of IQ. Create a histogram and a Q-Q plot for the IQ values. Q-Q plots are created via Analyze> Descriptive Statistics> Q-Q plots. You may provide a rough sketch the graphs below. Histogram: Q-Q Plot: 1. Describe the shape of the resulting histogram. 2. Is a normal distribution a reasonable model for IQ scores in the population based on this Q-Q plot? 42 Task 2: We have previously explored the employee data.sav dataset variable of salary. Now, let’s check to see if the overall distribution of salary can be considered normal, and then to see if the distribution of salary might be normal depending on minority status. 1. Create a histogram for the variable salary, and describe its shape. 2. Create a Q-Q plot for salary using Analyze> Descriptive Statistics> Q-Q Plots. Based on the evidence of these graphs, is a normal distribution an appropriate model for current salary? Why or why not? 3. Create histograms and Q-Q plots separately for minorities and non-minorities (recall the Data> Split File command). Does the distribution of salary appear to be different for either group? Comment on both the histograms and Q-Q plots. 4. For salary, is a normal model reasonable for either minorities or nonminorities? 43 Check Your Understanding: Match the corresponding histograms, boxplots, and Q-Q plots. Histogram A Histogram B Histogram C Boxplot A Boxplot B Boxplot C QQ Plot A QQ Plot B QQ Plot C Which type of graph best shows the shape of the underlying distribution? Which type of graph best shows if the underlying distribution appears to be normal (bell-curve)? 44 Example Exam Question on Sequence Plots and Q-Q Plots A new method of measuring phosphorus levels in soil is under consideration. A sample of 11 soil specimens is analyzed using the new method. The time series (sequence plot) for the 11 observations is presented below. a. Comment on the overall stability of these data based on this plot. 660 640 620 600 580 PHSLEV 560 540 520 1 2 3 4 5 6 7 8 9 10 11 Sequence number b. An assumption of many statistical inference methods is that the data follow a normal distribution. In the space provided below, sketch how the Q-Q plot would appear if a normal distribution was a good model for phosphorus levels. 45