Author(s): Brenda Gunderson, Ph.D., 2011 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution–Non-commercial–Share Alike 3.0 License: http://creativecommons.org/licenses/by-nc-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use. Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers. Some material may be sourced from: Mind on Statistics Utts/Heckard, 3rd Edition, Duxbury, 2006 Text Only: ISBN 0495667161 Bundled version: ISBN 1111978301 Material from this publication used with permission. Attribution Key for more information see: http://open.umich.edu/wiki/AttributionPolicy Use + Share + Adapt { Content the copyright holder, author, or law permits you to use, share and adapt. } Public Domain – Government: Works that are produced by the U.S. Government. (17 USC § 105) Public Domain – Expired: Works that are no longer protected due to an expired copyright term. Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain. Creative Commons – Zero Waiver Creative Commons – Attribution License Creative Commons – Attribution Share Alike License Creative Commons – Attribution Noncommercial License Creative Commons – Attribution Noncommercial Share Alike License GNU – Free Documentation License Make Your Own Assessment { Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. } Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in your jurisdiction may differ { Content Open.Michigan has used under a Fair Use determination. } Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your jurisdiction may differ Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that your use of the content is Fair. To use this content you should do your own independent analysis to determine whether or not your use will be Fair. Module 2: Sequence Plots and Q-Q Plots Objective: In this module, you will add to your set of graphical tools for examining data. The graphs you will examine include sequence (time) plots for data collected over time and Q-Q plots for checking whether a normal model is a reasonable distribution for a quantitative variable. Overview: Module 1 provided a summary of some graphical and numerical tools that can be used to summarize the distribution for quantitative and qualitative variables or responses. We may use those tools for the activities in this module, but we will also need to utilize the new tools described below. Note that these graphical tools are introduced solely in lab, not in lecture, so it will benefit you to read this overview thoroughly. Sequence (Time) Plots: Data is often gathered over time. Employment rate, stock prices, and sales figures are just a few examples. When data is gathered over time it is generally wise to examine the data plotted against time. Plots against time can reveal the main features of a time series, overall patterns and striking deviation from those patterns. Some overall patterns that may arise are: A persistent, long-term rise or fall called a trend (either increasing or decreasing). A pattern that repeats itself at regular intervals of time called seasonal variation. A persistent, long-term increase (or decrease) in the variation of the observations. This is called a pattern in variation. If data is collected over time, a sequence plot can be used to check the assumption of a random sample, which will be needed for inference procedures. As you have learned, random samples consist of independent and identically distributed (i.i.d.) observations. This means that the observations all come from the same parent population and are independent of one other. With a sequence plot, you can check the identically distributed aspect of a random sample by looking for evidence of stability in the plot. Stability is supported when both the mean of the observations and the amount of variation among observations appear to be constant over time. Below are two example sequence plots; the first appears to be stable while the second does not appear stable. 26 Q-Q Plots: Later in this class we will see that the assumption of a normal model for a population of responses will be needed in order to perform certain inference procedures. Previously, we have seen that a histogram can be used to get an idea of the shape of a distribution. However, there are more sensitive tools for checking if the shape is close to a normal (bell-shaped) model. A good plot that can be used to check for normality is called a Q-Q Plot, which is a plot of the percentiles (or quantiles) of a standard normal distribution against the corresponding percentiles of the observed data. If the observations follow approximately a normal distribution, the resulting plot should be roughly a straight line with a positive slope. Deviations from this would indicate possible departures from a normal distribution. At the right is an example of a Q-Q Plot showing data that does seem to come from a population with an approximately normal distribution. The three graphs below are examples for which a normal model for the response is not reasonable. The Q-Q plot above left indicates the existence of two clusters of observations. The Q-Q plot above center shows an example where the shape of the distribution appears to be skewed right. The Q-Q plot above right shows evidence of an underlying distribution that has shorter tails compared to those of a normal distribution. Finally, we consider an example Q-Q plot that appears normal with the exception of one data point. In this case, we would say the Q-Q plot shows evidence of an underlying distribution which is approximately normal except for one large outlier that should be further investigated. Note that outliers could appear in either the upper or lower tail. Note: It is most important that you can see the departures in the above graphs and not as important to know if the departure implies skewed left versus skewed right and so on. A histogram would allow you to see the shape and type of departure from normality. 27 28 Activity 1: Time-dependent Data Background 1: Below is the death rate (number of deaths per 100 million miles driven) provided for a number of years starting with 1960 and going through 2004. Year Rate 1960 5.1 1962 5.1 1964 5.4 1966 5.5 1968 5.2 1970 4.7 1972 4.2 1974 3.5 1976 3.3 Year Rate 1978 3.3 1980 3.3 1982 2.8 1984 2.6 1986 2.5 1988 2.4 1990 2.5 1992 2.3 1994 1.8 Year Rate 1996 1.8 1998 1.7 2000 1.5 2002 1.5 2004 1.6 Task 1: Display and summarize this data in an appropriate and useful way. What do you see? Would it make sense to make a histogram of the death rates? The following steps will guide your thinking as you complete this task. 1. Input this data in SPSS. To enter data in SPSS, open a new file. To set up variables, swap from Data View to Variable View. In this view, you can set the number of decimal places for each variable, give them longer labels, and input coding for categorical variables. Return to the Data View when you are ready to enter the data. Inputting data works just like an Excel spreadsheet. 2. Why should a sequence plot be made to display this data in a useful way? 3. Make a sequence plot for the data using Analyze>Forecasting>Sequence Charts. What does the graph show? Comment on if you see any trend, seasonal variation, or pattern in variation in this graph. 4. Does the plot appear to be stable? What would you conclude if asked if the data were a random sample of death rates? 5. Would it make sense to make a histogram of the death rates? Why or why not? 29 Background 2: The data set oldfaithful.sav contains the date and duration of eruptions (in minutes) of the Old Faithful geyser. The data was collected several times per day over 23 consecutive days. Task 2: Display and summarize the data in an appropriate and useful way. What do you see? Does there appear to be any pattern to this process? The following steps will guide your thinking as you complete this task. 1. Make a time plot for the data using Analyze> Time Series> Sequence Charts. What does the graph show? Are there any patterns to the process? 2. Does the plot appear to be stable? What would you conclude if asked if the data were a random sample? Check Your Understanding: Circle the appropriate word(s) to complete the sentences. We use sequence (or time) plots to check the independent identically distributed part of the random sample assumption by looking to see if the data appear to be stable normal , that is, have a constant mean and constant variation over time. If there is any pattern in the observations over time, we should should not make a histogram of the observations for further analysis. One important type of time-dependent data is seasonal data (data that shows a pattern that corresponds to seasons of the year). Show below what a sequence plot of seasonal data would look like. Include your labels. 30 Activity 2: Checking Normality Background 1: You have discussed using a histogram to examine the shape of the distribution of a quantitative variable. If the histogram shows a fairly homogeneous set of observations, we might like to assess if a particular distribution, called the normal distribution, is a reasonable model for the response. A better graph for assessing normality is a Q-Q plot. In this problem we will examine a few distributions and see what each corresponding Q-Q plot looks like. Task 1: Suppose a study examined high school students and the relationship between IQ and gpa. Use the iq.sav dataset and examine the distribution of IQ. Create a histogram and a Q-Q plot for the IQ values. Q-Q plots are created via Analyze>Descriptive Statistics> Q-Q plots. You may sketch the graphs below. 1. Describe the shape of the resulting histogram. 2. When a bell-shaped, normal distribution is reasonable, the points on a Q-Q plot will approximately follow a straight line with positive slope. This line shows where points should fall if the sample quantiles are equal to the theoretical quantiles. Is a normal distribution a reasonable model for IQ scores in the population based on this Q-Q plot? Task 2: We have previously explored the Employee data.sav dataset variable of SALARY. Now, let’s check to see if the overall distribution of SALARY can be considered normal, and then to see if the distribution of SALARY might be normal depending on the sex under consideration. 1. Use the Employee data.sav dataset and create the histogram for the variable SALARY (current salary). 2. Describe the shape of the resulting histogram. 31 3. Create a Q-Q plot for SALARY using Analyze>Descriptive Statistics>Q-Q Plots. Based on the evidence of these graphs, is a normal distribution an appropriate model for current salary? Why or why not? 4. Create histograms and Q-Q plots separately for each sex (recall the Data> Split File command). Does the distribution of SALARY appear to be different for either sex? Comment on both the histograms and Q-Q plots. 5. For salary, is a normal model reasonable for either sex? Check Your Understanding: Assume that a normal model is NOT reasonable for the distribution of the waiting time at a bank. Using the best plot available for assessing normality, draw an example plot that would indicate this. Hint: For Q-Q plots, due to how they are constructed, points must be increasing along the y-axis as you move right across the x-axis. Notes about Saving SPSS data sets (1) When you end your SPSS session, if you have not saved a data set that you entered or changed, a dialog box will ask you "Do you want to save changes?”. Click on OK if you want to save the data set at this point. A dialog box will appear asking you to name the data file, then click on Save. Click on Don't Save if you don't want to save the data. Also note that you cannot save anything to a "locked" disk. (2) When you save a data file using the Save as... command, the dialog box that appears allows you to designate where the file is to be saved. Click on the box at the top with the arrow in it until it shows the name of the place where you want to save your file (usually on your own disk). 32 Example Exam Question on Sequence Plots and Q-Q Plots A new method of measuring phosphorus levels in soil is under consideration. A sample of 11 soil specimens is analyzed using the new method. The time series (sequence plot) for the 11 observations is presented below. a. Comment on the overall stability of these data based on this plot. 660 640 620 600 580 PHSLEV 560 540 520 1 2 3 4 5 6 7 8 9 10 11 Sequence number b. An assumption of many statistical inference methods is that the data follow a normal distribution. In the space provided below, sketch how the Q-Q plot (quantile plot) would appear if a normal distribution was a good model for phosphorus levels. 33