EDL 726: Introduction To Statistical Concepts and Visual Display Section 1: Introduction to SPSS and the Generation of Visual Strategies to Convey Statistical Data For the next three weekends, you will receive several step-by-step tutorials that will guide you through the concepts for specific types of analyses used for gaining an understanding of how to produce, analyze, and display statistics in a meaningful and informative way. The section for this week will contain 2 parts. The first part will include a basic introductory tour of SPSS and will include a description of the start up procedure, descriptions and explanations of the windows you will encounter while using SPSS, as well as data entry. Once this is preliminary lesson is completed, I will guide students through the procedures involved in generating the most common visual representations of data. Part 1: To Begin SPSS • Step #1: Activate the program by double clicking on the SPSS icon. If you have a previous version of SPSS (PASW) the screens will be very similar to the ones used in-this tutorial. • Once SPSS has started it may go directly to what looks like a spreadsheet or you may encounter a beginning screen that asks you to check an option concerning your intentions. We will be choosing the “Type in data” option if this beginning screen appears. This will open a window and will look very similar to an excel spreadsheet. • Step #2: Once you click on the “Type in Data” and click “OK” • A blank spreadsheet screen will appear (see figure). • SPSS utilizes two windows and are designated as Data View and Variable View (see blue arrows in figure). You can switch between these two windows by clicking on the tabs in the lower left (PC) or middle (Mac). Variable View: This is where the variables are entered and defined Data View: This is where the raw data are entered • Step #3: Click on the “Variable View”. • This will take you to the screen that allows you to enter descriptive information about data to be entered • Near the top of the page you will see a number of column headings. We are only going to concern ourselves with the columns labeled “Name”, “Label”, “Values”, and “Measure” at this time Name: The name you assign to a variable Label: A description of the variable (optional) Value: This will be discussed later Measure: This indicates the type of measurement scale those data are from • All other column heading should be left at their defaults (the exception is that I typically have the decimals column set to 2 places which may or may not be the default) • Step #4: Data Entry - believe it or not that concludes our SPSS start up tutorial – we will next focus on entering our first data set. Go back to the spreadsheet view by clicking the “Data View” tab and click on the first empty cell in the upper left cell • Start entering data from the top down as shown in the figure below. I will be using a sample data measuring student test scores for a mathematics placement exam (see figure). Notice that when you start typing in those data, the column heading changes to “VAR0001”. We will rename this variable in the next step • Step #5: Once the data are entered as shown in Step 4, click on the “Variable View” Tab. This will allow us to define the properties of our variable. • Once you enter the variable view window, you will notice that the top row has been filled in automatically and the default variable name VAR0001 had been entered. We are going to change that to the more descriptive – Math_Placement • Click in the cell containing the name VAR0001 and change it to Math_Placement. Be careful because sometimes this step is a little trickier than it should be. SPSS does not allow spaces in variable names or numbers • All other values should look like the figure below. Typically, the only other thing that you will need to change is the “Measure” which should be set to the type of data that is recorded – in this case, it is scale data because Systolic Blood Pressure is continuous data. Click in the box to change the measurement scale. • • For this demonstration we do not need to worry about the “Label” column or the “Values” column as Math_Placement is descriptive enough and since we only have one variable (and only one measurement per patient) the “Values” column does not apply here. Step #6: Click on the “Data View” tab and you will return to the spreadsheet view and you are now ready to generate some informative graphs/charts that will aid in the description of your data. Notice that the column heading in the spreadsheet view reflects the name you assigned in the previous step. Part II: Generating Various Graphical Output from a Data Set After visually scanning our data set, there are certain things that we can conclude. For example, we can easily see the highest and lowest systolic blood pressure values, we can also see that most values are in the 120s, but if this data set were much larger, as they typically would be, it would be hard to draw any meaningful inferences based on visual inspection alone. Because of this, we need to summarize those data and explain the data set in a way that can be easily interpreted by others. There are essentially two ways to do this – through the use of summary statistics and by visual representation. Both of these strategies attempt to break those data down into an easy to understand way. One of the first steps done in any analysis is to generate the basic statistics including the measures of central tendency (mean, median and mode) as well as the standard deviation (more about the meaning of this later). Other statistics that are of use include the range, interquartile range, frequencies, etc. We will not be looking at all of the output that SPSS can generate but instead will be focusing on a few important numbers that will appear frequently throughout the semester. Let’s start off by starting SPSS and entering the same data that we used in our example concerning systolic blood pressure for 20 individuals. If you still have that data set open, that’s great. If not, follow the directions already given and get to the point where your data are entered and the variable has been designated as Math_Placement and follow these steps: • Step #1: Assuming that you have completed entering those data for the 20 individuals, choose “Analyze” from the main drop-down menu and choose “Frequencies”. • You will now be presented with a dialogue box similar to the one shown on the left. • • Click on the SYS_BP variable in the window on the left and click on the little blue arrow. This will move the variable Math_Placement over to the “Variable(s)” window and is now letting SPSS know what variables you are interested in summarizing. The following figure shows what this dialogue box should look like Step #2: It is now time to instruct SPSS to conduct the specific statistics that we are interested in. We will do this by clicking on the “Statistics” button. This will bring up another dialogue box with a number of descriptive statistical options. • For this exercise, make sure to check the boxes indicated in the figure below. After the appropriate descriptive statistics are checked, click the “Continue” button and you will return to the previous dialogue box. • Step #3: Once you have returned to the initial dialogue box, click on the “Charts” button. Again, you will be presented with another dialogue box • For this exercise (and for statistics reasons) make sure that the chart type you want to display is a histogram. Histograms are for continuous data, Bar charts and Pie charts are for discrete data (nominal data). Since blood pressure is a continuous variable, histograms are the appropriate choice. • • Once you have chosen the appropriate chart, click the continue button and you will return to the original dialogue box. Simply click on the “OK” button and SPSS will perform the requested operations. Step #4: SPSS includes two separate windows. One window includes the data view as well as the variable view. The other window becomes important once you have SPSS perform and analysis. This window is called the output window and of course contains all of your output (or analyses). Access your output window (called “output 1”) through the window drop-down menu • You should have three pieces of output – 2 tables and a histogram. The output should look like this: Congratulations, you have successfully entered data into SPSS and instructed SPSS to run specific descriptive statistics as well as an appropriate visual representation of those data. Almost all operations in SPSS follow these same concepts – lots of dialogue boxes and lots of choices however, there is another very important task to perform – that of interpretation. We will now look at the individual tables as well as the individual values within each table and describe what they mean. Once a person understands the steps involved in running an analysis on SPSS it becomes a simple operation, but many people, even professionals, have trouble understanding the importance and relevance of output in its totality. I will wait to explain some of the terms until we have a bit more background in statistics, but we can begin with some of the more basic terms. These descriptive statistics must be understood in order to interpret some of the statistical analyses we will be conducting later. Our first table of data contains most of the descriptive statistics that we are interested in. In SPSS there are a number of ways to get the same data and I only showed you this method (another way to get this is to go to Analyze – Descriptive statistics and follow the options). I encourage all students to “play around” with other methods of producing data. Let’s take a look at the first table in some detail: This indicates the number of blood pressure measurements taken. The mean is the average of all the measurements. Even though we calculated all three measures of central tendency, the mean is the appropriate measure for continuous data Standard deviation and variance are measures of dispersion and will be discussed later. If we take the square root of variance, we will get standard deviation. The range is simply the distance from the smallest measurement to the highest The second table of data that was produced is essentially a frequency distribution table. This output is especially valuable when summarizing data in terms of individual scores: Frequency: The number of times each individual score occurred Percent: Percent is calculated as the frequency divided by the total number of scores: f/n for score with a value of 118 the equation would be 1/20 = .05 or 5% Cumulative frequency: the running total frequency of scores starting from the smallest The last couple of outputs we will talk about will include the box plot, and the stem and leaf. The box plot is really an extension of the percentiles that were listed in the descriptives table (I left out the description of those because it is more useful when looking at a boxplot. In order to produce a box plot as well as a stem and leaf chart follow the steps below: • Step #1: Choose “Analyze” from the main drop-down menu and choose “Explore” from the fly-out menu. • You will again be presented with the dialogue box on the left (it looks a lot like some of the other dialogue boxes) and you should make sure that the “Both” button is selected under the “Display” heading. Then follow the instructions as before and select the variable that you are interested in and move it over to the “Dependent List” window using the associated arrow. • Step #2: From the dialogue box on the right click “Plots”. This will bring up a dialogue box that will offer allow you to choose which plots are appropriate for your data. • At this point we will not be concerned with the “Statistics” button in the dialogue box on the right – we will just accept the defaults (in actuality, the default setting will produce many of the same descriptive statistics that we already have). • Make sure that the “Factor levels together” radio button is selected as well as the “Stem-and-leaf” checkbox is selected. The “Normality plots with tests” checkbox may be selected by default – that is ok, but we will address that output at a later time. • Click on “Continue” and when you get to the previous dialogue box, click on “OK”. SPSS will now produce the chosen output. You should now access the output window as we have done before. SPSS appends all analyses at the end of the file so you may need to scroll down quite a bit to find you requested output. Don’t worry if there seems to be a lot more output then you expected. At this point we will only be reviewing the box-plot and the stem-and-leaf output. Scan through you SPSS output and you should come across the following plot. This is called a box-plot and really consists of 5 parts. This line, often called the upper whisker, represents the largest value in our data set (SYS_BP largest value = 125) This line represents the 75th percentile of scores. This value should match the 75th percentile score shown in our descriptive statistics table This heavy line represents the median or the 50th percentile score and should match both the median and 50th percentile score shown in descriptive table This heavy line represents the 25th percentile score and should match the 50th percentile score shown in descriptive table This line, often called the lower whisker, represents the lowest value in our dataset (SYS_BP lowest value = 118. There is a lot of information that can be understood looking at a box-plot. For example, we know what the range of our data is simply by looking at the high and low whiskers. We can see how the middle 50% of the scores are clustered by the size of the box. We can also get an idea of the shape of the distribution by examining the location of the middle 50% of the scores relative to the whiskers. For example, if the box is directly in the middle of the whiskers and the heavy line (median) is in the center of the box, the distribution would be perfectly symmetrical. If the entire box was pushed toward one of the whiskers, then we would have a skewed (asymmetrical) distribution. To some extent we can see this distribution by examining our last output – the stem-and-leaf plot. The stemand-leaf plot simply converts our frequency distribution table into stems (the values obtained for Math_Placement) and leafs (the frequencies of each value indicated by “0”s. The result is a kind of rudimentary depiction of our distribution. Note how this relates to the box-plot. And to the histogram produced earlier. There is an additional concept to the box-plot that should be addressed here. That is the concept of extreme scores or outliers. In our example of Math_Placement scores, there were no outliers and therefore our boxplot was very simply – indicating the range of scores with the upper and lower whiskers. If the data set did include some extreme scores, the output would be a bit different. The box-plot would then list the scores starting from the 5th percentile through the 95th percentile and consider these “normal” values. The other values consisting of scores lying outside the whiskers would be plotted as points. In order to graphically represent this in a simple way, I will add 6 scores to our data set that would be considered extreme scores. The figure below represents our new data set including the original 20 measure plus the 6 extreme scores. Notice that my added scores are now outside the whiskers and are shown as points. This is done so the researcher can flag these to see if they are truly outliers and if they should be eliminated from the analysis. Notice (if you can see it) that the range of the box-plot is unchanged by looking at the values of the whiskers. However, the location of the box as well as the heavy line indicating the median (50th percentile score) have changed because the values that I added are still taken into account when determining the percentile rank of all scores.