Tutorial SPSS 1 descriptive statistics Introduction The purpose of this tutorial is to familiarize students with the necessary techniques for statistical analysis. The program used is SPSS (for Windows). The "Hoorn Study" will be introduced below. It will be used frequently as an example of statistical analysis. The Hoorn Study The Hoorn-study will be used as an introductory example in the computer lab. This file is called HOORN.SAV. It is an epidemiological screening for the prevention of diabetes (diabetes mellitus) in the general population. The study was conducted among 2500 residents of the municipality of Hoorn, age group between 50-75 years.The representative sample is limited to 250 participants. "HOORN.SAV" contains a selection of 25 variables from the General Questionnaire. It asks about personal background information (gender, education, etc.), medical history, symptoms related to diabetes, life events and lifestyle. Moreover, the file contains a selection of 13 variables on laboratory measurements of the same study population. Of course, all data has been stripped of sensitive data (name, address). For most variables, the "user missing values' are coded. These are codes (usually impossibly high values such as 9, 99 or 999), who indicates the missing values. These "missing values" must be in the calculations by SPSS obviously ignored. Notation Parts from the main menu and submenus, procedure names, buttons to sub-menus and other controls such as Continue and OK, are indicated in Bold with an initial capital shown and separated by a ";". Names of windows, boxes, radio buttons and items that a check should be put, are in italics. This also applies to text that appears on the computer printout. File names and variable names are shown in UPPERCASE. Startup SPSS and Location of Files Turn on the computer and open Windows. Double click on the SPSS icon (if present) on the desktop or click the Start button on the task bar bottom left and go through the option Program s to select SPSS for Windows. The title bar, menu bar, toolbar (toolbar or toolbar), the data editor (presently an empty spreadsheet) and the status bar should appear. At the bottom left of the screen will be two tabs: "data view" and "variable view". If the measurement display "data view" is active, then a spreadsheet is available to enter data subjects. If the screen 'variable view' is active, variables can be defined (is the variable continuous or not, definition of missing values, etc.) The content and appearance of the various toolbars and menus changes depending on which window is active. If a particular item is no longer available, check if the correct window is active. The data necessary for the execution of the tasks of this computer lab have already been entered and stored in SPSS system files. System Files are SPSS-specific files that not only contain all the information, but also a definition (labels, "missing value" definition, etc.). The information in such a system file is saved in a workable version only by SPSS computer code. All SPSS system files have this extension, "SAV". This makes them easily recognizable. Syntax When you complete an assignment (eg the mean and median of a variable is calculated), SPSS can be used, or the command can be written to a syntax. A syntax contains the 'computer language' with which the command can be executed. When you write to a command syntax and save this syntax, it can later be calculated by SPSS, without having to specify the command again. The additional advantage is that the assignment of complex analyzes do not need to be written down or remembered exactly. The syntax codes the analysis. It is therefore advisable to always write a syntax, and then pass the command to execute it. This is a simple procedure that will be described on page 4. A syntax file can be recognized by the file extension 'SPS'. SPSS Viewer The results of performed functions appear in the SPSS Viewer. Because this is a rather obscure and large file, it is recommended that this window is regularly closed without "saving" or regular components of the output will discard the relevant sections in the left pane (Viewer) to select and " delete ". On the last way you keep appealing examples of SPSS output over that again you can store in a file with a name of your choosing. The extension of Viewer documents 'SPO'. You can then that file from the hard disk to your disk using Windows Explorer. This obviously applies to in this lab used SPSS system files (SAV) and syntax (SPS). PRACTICAL LAB 1: DESCRIPTIVE STATISTICS AND DATA SCREENING In this exercise, the functions for describing and summarizing nominal, ordinal and continuous data are discussed. The descriptive statistics commands are found in the submenu Analyze; Descriptive Statistics. The procedures Frequencies and Descriptives will be addressed first. Next, attention is given to the procedures of Means from Analyze; Compare Means. Lastly, the procedures of Explore, from submenu Analyze: Descriptive Statistics will be covered. FREQUENCY FUNCTION The Frequency function is particularly suitable for describing and graphically displaying nominal and ordinal variables (frequency tables and bar charts). The function can, however, also be used for continuous variables. Sizes for location (mean, median, mode) and dispersion (variance, standard deviation of the data (standard deviation), upper and lower values (minimum and maximum), range (range) and percentiles) can be calculated using the sub-dialogue button Statistics, or Frequency dialogue box. Bar charts, pie charts and histograms can be made with the sub-dialogue button Charts. For the following exercise, we use three variables of different measurement levels from the file 'HOORN.SAV': BURGST : marital status (nominal) OPLEID : education (ordinal) SGLUCN : fasting glucose value (continuous) First open SPSS (see Introduction) and open file 'HOORN.SAV' through File; Open. The practicum for the required files are available on the intranet. Choose the commands Analyze; Descriptive Statistics; and Frequencies. After choosing Frequencies a dialogue box appears. This window allows the variables (in our case BURGST, OPLEID and SGLUCN) to be specified to make a frequency distribution. (Note: Depending on the settings of SPSS, the names or labels of the variables will appear in the dialogue box). These labels, especially if long, can be annoying. Change the settings by using Edit, Options on the General tab and check boxes display names and Alphabetical of Variable Lists. Now the file again (in a somewhat complicated way) retrieved via File; New; Data (opens a blank screen) and File, Open, "HOORN.SAV '(we're back in Hoorn). Unscrew with Frequencies SGLUCN again.) Select the relevant variables by double clicking on it once or click and use the arrow to the right in the Variables box. Tick if necessary, select Display frequency tables (otherwise you will not get the desired frequency tables). Then click the OK button. Window will appears in Output 1 of the SPSS Viewer. The Frequencies Statistics table will be displayed with an - a variable - the number of valid values ("valid") and the number of missing values ("for missing"). 1 of the 250 participants, the bourgeois state apparently not known and 3 participants missing data on education? Each variable is then printed in a convenient table and each value displays the number of cases (frequency), the percentage of cases calculated from the total number of cases including the "for missing" values (Percent), the percentage of cases calculated for the cases with only valid values, which "for missing" are disregarded (Valid Percent), and the cumulative percentage of cases with valid values (Cumulative Percent). If accurate, each variable, as well as the descriptions of the individual values ("value labels") appears. If this is not the case, set this option as follows: From the Edit menu click Options and then the Output Labels tab. Choose the option Variable values in labels shown as: Option Values and labels, and click OK. From now on, in the output of the variable values Frequencies both the labels are shown. Both tables can now be seen easily. There is a lot of output produced for the continuous variable SGLUCN. For a continuous variable a frequency table does not fit! If we do this variable in a table to summarize, we'd better first divide the class and then create a frequency distribution of the classes. We will now calculate some statistical measures. Select again the Frequencies dialogue using Analyze, Descriptive Statistics, and Frequencies. Click Statistics. You are now in the dialogue Frequencies: Statistics. Most options are self-explanatory. With the aid of the option Percentile (s), any desired percentile can be calculated. If you are the tenth and ninetieth percentile for example, to calculate, click this option Percentile (s). Then type "10", then click Add (now type 10 in the box), then type "90" and click on Add (now type 90 in the box). Quartiles are moreover standard in de dialogue box ("Quartiles"). Click after the other desired statistical measures selected to have the Continue button and then press the OK button. Interpret the results. Note the measurement level of the variables! How useful are these statistical measures for nominal and ordinal variables? We can use the frequency distributions of these variables and display them graphically. For nominal and ordinal variables, we create a bar chart ("Bar chart") or a pie chart ("Pie chart") for continuous variables and a histogram. SPSS makes the latter itself a classification. We build these commands via Analyze, Descriptive Statistics, and Frequencies. Consider which variables a bar chart, pie chart, or histogram are eligible. On the Frequencies dialogue Charts and in the subsequent dialog box or the button bar charts (or Pie charts) or Histograms. In the latter case, with the option with normal curve is still a normal distribution on the histogram will be displayed. This normal curve is constructed on the basis of the mean and standard deviation of the data. Suppress in continuous variables also the frequency table by the Frequencies dialog box check the option Display frequency tables to remove. (Note: Frequencies so several times.) DESCRIPTIVE FUNCTION The Descriptive function is primarily intended for continuous variables. This function is the most important statistical measure to be calculated, but the options are not as extensive as in the Frequencies function. The output is more compact. No frequency tables are produced and the results for multiple variables are arranged in a table under each card. By default, the mean, standard deviation, minimum and maximum of the data are calculated. Click on the Options button to select several statistical measures. By selecting Save standardized values as new variables it is also possible to specify the variables to calculate Z-scores (the difference of each observation and the average of all observations divided by the standard deviation of the observations). These Z-scores as new variables are added to the existing variables. SPSS chooses variable names for these new variables. Determine the variables LENGTH, WEIGHT AND SBLDSYS1 (systolic blood pressure at the first measurement) from the file 'HOORN.SAV' the mean, standard deviation, the variance, the standard error of the mean ("standard error", SE) and the minimum and maximum values. You will notice that at the input of the variable SBLDSYS1 there is a mistake. What happened? Correct the error and repeat the procedure. MEANS FUNCTION The Means function (only for continuous variables) can be a variable number of descriptive statistical measures values, but for one or more subgroups, such as men and women. GENDER here is the variable that indicates the grouping (the Independent variable). We can calculate some statistical measures of fasting blood glucose (SGLUCN) for the different treatment categories of diabetes (DMBEHAND). Select the dialogue by Means by Analyze, Compare Means, and Means. Put SGLUCN into the Dependent list box and the box DMBEHAND Independent list. By default, all variables used in the Dependent list are (now only SGLUCN) the average, the number of valid cases and the standard deviation of the observations calculated for all values of the variables in the list specified by Independent (here only DMBEHAND) . By clicking on the Options button, many other statistical measures are selected. Select, for example, the median and the minimum and maximum. After pressing the Continue and OK buttons, a clear output in the SPSS Viewer will be seen. Finally, calculate the summary statistical measures for the variable SBLDSYS1 (systolic blood pressure screening 1st measurement) by gender. Do the same for a dichotomy of the variable weight (<80 and > = 80 kg). Recode the variable WEIGHT (using Transform, Recode, Into Different Variables). An alternative approach to the calculations of subgroups is satisfied by using the Select Cases dialog through Data, Select Cases. Select every subgroup by using ‘if condition is satisfied’. For example, select first the men (SEX = 1) and perform the Frequencies function with all required (statistical and graphical) options. Forget to suppress the frequency table. Repeat this procedure by selecting the women (SEX = 2). The advantage of these additional operations is that now the graphics capabilities (histogram) Frequencies of the procedure can be used. Another alternative: Data, Split file for SEX (and Frequencies). EXPLORE FUNCTION In order to determine whether the (continuous) data from your sample is normally distributed, you can make use of the already known functions Frequencies and Descriptives. The Explore function contains all the necessary options to incorporate these. It is intended for continuous variables, the required (descriptive) statistical measures of location and dispersion. Explore then produces various plots using various lay-outs: histograms, box-andwhisker plots, voice and leaf plots and normal probability plots. These may also be tested for Normality. Reopen the file 'HOORN.SAV. We will now continue with the variable LENGTH. Select the Explore dialogue using Analyze, Descriptive Statistics, Explore. Place LENGTH in the Dependent List box and run this command (OK). Two tables and two graphs appear in the SPSS Viewer. The first table contains weather information on the number of valid cases and the number of cases with missing values. The second table contains statistical measures of location and spread, including median and inter-quarter partial range (IQR). The mean and median differences and the standard deviation of the observations is small. In this case (LENGTH dealing with only positive values), these results seem accurate (Attn: Normality of LENGTH). The second table displays a voice and leaf plot that looks normal. The last graph is a standard output in the form of a box-and-whisker plot. This looks accurate also (median in the middle of the box, symmetric whiskers). Moreover, Explore uses, in this last plot, the CASE numbers of cases that are possible outlier and / or rather extreme values. If this were the case in our example, the questionnaire would be reviewed to understand where the extreme values come from. Repeat the exercise, but now click on the button in the dialogue box Plots Explore. Put check marks next to the items Histogram and Normality plots with tests and end the command sequence with Continue and OK. After the already known 2 tables, a third table printed with the results of the Kolmogorov-Smirnov test for Normality will appear. If the number of cases is 50 (this is not the case), the key of "Shapiro-Wilks' appears on the output. There are quite a few comments on these tests, (given our 250 cases) so ignore these results. For those who are uncomfortable by this: the p-value for the test is 0.200, we reject the null hypothesis "LENGTH is not normally distributed". More importantly, for determining the normality of the data, we use visual inspection. For this we have 4 charts available (we ignore the fifth). The histogram looks accurate, as well as the voice and leaf plot course. The next chart is a normal plot (Normal QQ Plot of LENGTH). The last plot is an excellent tool to decide whether or not the variable is normally distributed. If a (almost) straight line is created, then this variable is thought to be normally distributed (which is the case here). The Detrended Normal Q-Q Plot LENGTH we ignore. The last plot that appears is the box-andwhisker plot, and it also looks accurate. Explore can also be used on sub-groups. Type SEX in the factor list of the dialogue Explore. Now the results are displayed clearly for men and women separately.