Tutorial SPSS 1 descriptive statistics Introduction The purpose of this

advertisement
Tutorial SPSS 1 descriptive statistics
Introduction
The purpose of this tutorial is to familiarize students with the necessary techniques for
statistical analysis. The program used is SPSS (for Windows).
The "Hoorn Study" will be introduced below. It will be used frequently as an example of
statistical analysis.
The Hoorn Study
The Hoorn-study will be used as an introductory example in the computer lab. This file is
called HOORN.SAV. It is an epidemiological screening for the prevention of diabetes
(diabetes mellitus) in the general population. The study was conducted among 2500
residents of the municipality of Hoorn, age group between 50-75 years.The representative
sample is limited to 250 participants. "HOORN.SAV" contains a selection of 25 variables
from the General Questionnaire. It asks about personal background information (gender,
education, etc.), medical history, symptoms related to diabetes, life events and lifestyle.
Moreover, the file contains a selection of 13 variables on laboratory measurements of the
same study population. Of course, all data has been stripped of sensitive data (name,
address). For most variables, the "user missing values' are coded. These are codes (usually
impossibly high values such as 9, 99 or 999), who indicates the missing values. These
"missing values" must be in the calculations by SPSS obviously ignored.
Notation
Parts from the main menu and submenus, procedure names, buttons to sub-menus and
other controls such as Continue and OK, are indicated in Bold with an initial capital shown
and separated by a ";". Names of windows, boxes, radio buttons and items that a check
should be put, are in italics. This also applies to text that appears on the computer printout.
File names and variable names are shown in UPPERCASE.
Startup SPSS and Location of Files
Turn on the computer and open Windows. Double click on the SPSS icon (if present) on the
desktop or click the Start button on the task bar bottom left and go through the option
Program s to select SPSS for Windows. The title bar, menu bar, toolbar (toolbar or toolbar),
the data editor (presently an empty spreadsheet) and the status bar should appear. At the
bottom left of the screen will be two tabs: "data view" and "variable view". If the measurement
display "data view" is active, then a spreadsheet is available to enter data subjects. If the
screen 'variable view' is active, variables can be defined (is the variable continuous or not,
definition of missing values, etc.) The content and appearance of the various toolbars and
menus changes depending on which window is active. If a particular item is no longer
available, check if the correct window is active.
The data necessary for the execution of the tasks of this computer lab have already been
entered and stored in SPSS system files. System Files are SPSS-specific files that not only
contain all the information, but also a definition (labels, "missing value" definition, etc.). The
information in such a system file is saved in a workable version only by SPSS computer
code. All SPSS system files have this extension, "SAV". This makes them easily
recognizable.
Syntax
When you complete an assignment (eg the mean and median of a variable is calculated),
SPSS can be used, or the command can be written to a syntax. A syntax contains the
'computer language' with which the command can be executed. When you write to a
command syntax and save this syntax, it can later be calculated by SPSS, without having to
specify the command again. The additional advantage is that the assignment of complex
analyzes do not need to be written down or remembered exactly. The syntax codes the
analysis. It is therefore advisable to always write a syntax, and then pass the command to
execute it. This is a simple procedure that will be described on page 4. A syntax file can be
recognized by the file extension 'SPS'.
SPSS Viewer
The results of performed functions appear in the SPSS Viewer. Because this is a rather
obscure and large file, it is recommended that this window is regularly closed without
"saving" or regular components of the output will discard the relevant sections in the left pane
(Viewer) to select and " delete ". On the last way you keep appealing examples of SPSS
output over that again you can store in a file with a name of your choosing. The extension of
Viewer documents 'SPO'. You can then that file from the hard disk to your disk using
Windows Explorer. This obviously applies to in this lab used SPSS system files (SAV) and
syntax (SPS).
PRACTICAL LAB 1: DESCRIPTIVE STATISTICS AND DATA SCREENING
In this exercise, the functions for describing and summarizing nominal, ordinal and
continuous data are discussed. The descriptive statistics commands are found in the
submenu Analyze; Descriptive Statistics. The procedures Frequencies and Descriptives
will be addressed first. Next, attention is given to the procedures of Means from
Analyze; Compare Means. Lastly, the procedures of Explore, from submenu Analyze:
Descriptive Statistics will be covered.
FREQUENCY FUNCTION
The Frequency function is particularly suitable for describing and graphically displaying
nominal and ordinal variables (frequency tables and bar charts). The function can, however,
also be used for continuous variables. Sizes for location (mean, median, mode) and
dispersion (variance, standard deviation of the data (standard deviation), upper and lower
values (minimum and maximum), range (range) and percentiles) can be calculated using the
sub-dialogue button Statistics, or Frequency dialogue box. Bar charts, pie charts and
histograms can be made with the sub-dialogue button Charts.
For the following exercise, we use three variables of different measurement levels from the
file 'HOORN.SAV':
BURGST
: marital status (nominal)
OPLEID
: education (ordinal)
SGLUCN
: fasting glucose value (continuous)
First open SPSS (see Introduction) and open file 'HOORN.SAV' through File; Open. The
practicum for the required files are available on the intranet.
Choose the commands Analyze; Descriptive Statistics; and Frequencies. After choosing
Frequencies a dialogue box appears. This window allows the variables (in our case
BURGST, OPLEID and SGLUCN) to be specified to make a frequency distribution. (Note:
Depending on the settings of SPSS, the names or labels of the variables will appear in the
dialogue box). These labels, especially if long, can be annoying. Change the settings by
using Edit, Options on the General tab and check boxes display names and Alphabetical of
Variable Lists. Now the file again (in a somewhat complicated way) retrieved via File; New;
Data (opens a blank screen) and File, Open, "HOORN.SAV '(we're back in Hoorn). Unscrew
with Frequencies SGLUCN again.) Select the relevant variables by double clicking on it once
or click and use the arrow to the right in the Variables box. Tick if necessary, select Display
frequency tables (otherwise you will not get the desired frequency tables). Then click the OK
button. Window will appears in Output 1 of the SPSS Viewer. The Frequencies Statistics
table will be displayed with an - a variable - the number of valid values ("valid") and the
number of missing values ("for missing"). 1 of the 250 participants, the bourgeois state
apparently not known and 3 participants missing data on education?
Each variable is then printed in a convenient table and each value displays the number of
cases (frequency), the percentage of cases calculated from the total number of cases
including the "for missing" values (Percent), the percentage of cases calculated for the cases
with only valid values, which "for missing" are disregarded (Valid Percent), and the
cumulative percentage of cases with valid values (Cumulative Percent).
If accurate, each variable, as well as the descriptions of the individual values ("value labels")
appears. If this is not the case, set this option as follows: From the Edit menu click Options
and then the Output Labels tab. Choose the option Variable values in labels shown as:
Option Values and labels, and click OK. From now on, in the output of the variable values
Frequencies both the labels are shown.
Both tables can now be seen easily. There is a lot of output produced for the continuous
variable SGLUCN. For a continuous variable a frequency table does not fit! If we do this
variable in a table to summarize, we'd better first divide the class and then create a
frequency distribution of the classes.
We will now calculate some statistical measures. Select again the Frequencies dialogue
using Analyze, Descriptive Statistics, and Frequencies. Click Statistics. You are now in
the dialogue Frequencies: Statistics. Most options are self-explanatory. With the aid of the
option Percentile (s), any desired percentile can be calculated. If you are the tenth and
ninetieth percentile for example, to calculate, click this option Percentile (s). Then type "10",
then click Add (now type 10 in the box), then type "90" and click on Add (now type 90 in the
box). Quartiles are moreover standard in de dialogue box ("Quartiles"). Click after the other
desired statistical measures selected to have the Continue button and then press the OK
button. Interpret the results. Note the measurement level of the variables! How useful are
these statistical measures for nominal and ordinal variables?
We can use the frequency distributions of these variables and display them graphically. For
nominal and ordinal variables, we create a bar chart ("Bar chart") or a pie chart ("Pie chart")
for continuous variables and a histogram. SPSS makes the latter itself a classification. We
build these commands via Analyze, Descriptive Statistics, and Frequencies. Consider
which variables a bar chart, pie chart, or histogram are eligible. On the Frequencies dialogue
Charts and in the subsequent dialog box or the button bar charts (or Pie charts) or
Histograms. In the latter case, with the option with normal curve is still a normal distribution
on the histogram will be displayed. This normal curve is constructed on the basis of the mean
and standard deviation of the data. Suppress in continuous variables also the frequency
table by the Frequencies dialog box check the option Display frequency tables to remove.
(Note: Frequencies so several times.)
DESCRIPTIVE FUNCTION
The Descriptive function is primarily intended for continuous variables. This function is the
most important statistical measure to be calculated, but the options are not as extensive as in
the Frequencies function. The output is more compact. No frequency tables are produced
and the results for multiple variables are arranged in a table under each card. By default, the
mean, standard deviation, minimum and maximum of the data are calculated. Click on the
Options button to select several statistical measures. By selecting Save standardized values
as new variables it is also possible to specify the variables to calculate Z-scores (the
difference of each observation and the average of all observations divided by the standard
deviation of the observations). These Z-scores as new variables are added to the existing
variables. SPSS chooses variable names for these new variables. Determine the variables
LENGTH, WEIGHT AND SBLDSYS1 (systolic blood pressure at the first measurement) from
the file 'HOORN.SAV' the mean, standard deviation, the variance, the standard error of the
mean ("standard error", SE) and the minimum and maximum values.
You will notice that at the input of the variable SBLDSYS1 there is a mistake. What
happened? Correct the error and repeat the procedure.
MEANS FUNCTION
The Means function (only for continuous variables) can be a variable number of descriptive
statistical measures values, but for one or more subgroups, such as men and women.
GENDER here is the variable that indicates the grouping (the Independent variable).
We can calculate some statistical measures of fasting blood glucose (SGLUCN) for the
different treatment categories of diabetes (DMBEHAND). Select the dialogue by Means by
Analyze, Compare Means, and Means. Put SGLUCN into the Dependent list box and the
box DMBEHAND Independent list. By default, all variables used in the Dependent list are
(now only SGLUCN) the average, the number of valid cases and the standard deviation of
the observations calculated for all values of the variables in the list specified by Independent
(here only DMBEHAND) . By clicking on the Options button, many other statistical measures
are selected. Select, for example, the median and the minimum and maximum. After
pressing the Continue and OK buttons, a clear output in the SPSS Viewer will be seen.
Finally, calculate the summary statistical measures for the variable SBLDSYS1 (systolic
blood pressure screening 1st measurement) by gender. Do the same for a dichotomy of the
variable weight (<80 and > = 80 kg). Recode the variable WEIGHT (using Transform,
Recode, Into Different Variables).
An alternative approach to the calculations of subgroups is satisfied by using the Select
Cases dialog through Data, Select Cases. Select every subgroup by using ‘if condition is
satisfied’. For example, select first the men (SEX = 1) and perform the Frequencies function
with all required (statistical and graphical) options. Forget to suppress the frequency table.
Repeat this procedure by selecting the women (SEX = 2). The advantage of these additional
operations is that now the graphics capabilities (histogram) Frequencies of the procedure
can be used. Another alternative: Data, Split file for SEX (and Frequencies).
EXPLORE FUNCTION
In order to determine whether the (continuous) data from your sample is normally distributed,
you can make use of the already known functions Frequencies and Descriptives. The
Explore function contains all the necessary options to incorporate these. It is intended for
continuous variables, the required (descriptive) statistical measures of location and
dispersion. Explore then produces various plots using various lay-outs: histograms, box-andwhisker plots, voice and leaf plots and normal probability plots. These may also be tested for
Normality.
Reopen the file 'HOORN.SAV. We will now continue with the variable LENGTH. Select the
Explore dialogue using Analyze, Descriptive Statistics, Explore. Place LENGTH in the
Dependent List box and run this command (OK). Two tables and two graphs appear in the
SPSS Viewer. The first table contains weather information on the number of valid cases and
the number of cases with missing values. The second table contains statistical measures of
location and spread, including median and inter-quarter partial range (IQR). The mean and
median differences and the standard deviation of the observations is small. In this case
(LENGTH dealing with only positive values), these results seem accurate (Attn: Normality of
LENGTH). The second table displays a voice and leaf plot that looks normal. The last graph
is a standard output in the form of a box-and-whisker plot. This looks accurate also (median
in the middle of the box, symmetric whiskers). Moreover, Explore uses, in this last plot, the
CASE numbers of cases that are possible outlier and / or rather extreme values. If this were
the case in our example, the questionnaire would be reviewed to understand where the
extreme values come from.
Repeat the exercise, but now click on the button in the dialogue box Plots Explore. Put
check marks next to the items Histogram and Normality plots with tests and end the
command sequence with Continue and OK. After the already known 2 tables, a third table
printed with the results of the Kolmogorov-Smirnov test for Normality will appear. If the
number of cases is 50 (this is not the case), the key of "Shapiro-Wilks' appears on the output.
There are quite a few comments on these tests, (given our 250 cases) so ignore these
results. For those who are uncomfortable by this: the p-value for the test is 0.200, we reject
the null hypothesis "LENGTH is not normally distributed". More importantly, for determining
the normality of the data, we use visual inspection. For this we have 4 charts available (we
ignore the fifth). The histogram looks accurate, as well as the voice and leaf plot course. The
next chart is a normal plot (Normal QQ Plot of LENGTH). The last plot is an excellent tool to
decide whether or not the variable is normally distributed. If a (almost) straight line is created,
then this variable is thought to be normally distributed (which is the case here). The Detrended Normal Q-Q Plot LENGTH we ignore. The last plot that appears is the box-andwhisker plot, and it also looks accurate.
Explore can also be used on sub-groups. Type SEX in the factor list of the dialogue Explore.
Now the results are displayed clearly for men and women separately.
Download