BIOMETRICS I Description, Visualisation and Simple Statistical Tests Applied to Medical Data Lecture notes Harald Heinzl and Georg Heinze Core Unit for Medical Statistics and Informatics Medical University of Vienna Version 2010-07 Contents Chapter 1 Data collection ...............................................................4 1.1. Introduction ............................................................................ 4 1.2. Data collection ........................................................................ 6 1.3. Simple analyses..................................................................... 14 1.4. Aggregated data .................................................................... 16 1.5. Exercises .............................................................................. 18 Chapter 2 Statistics and graphs ...................................................19 2.1. Overview .............................................................................. 19 2.2. Graphs ................................................................................. 20 2.3. Describing the distribution of nominal variables ......................... 21 2.4. Describing the distribution of ordinal variables........................... 32 2.5. Describing the distribution of scale variables ............................. 33 2.6. Outliers ................................................................................ 57 2.7. Missing values ....................................................................... 63 2.8. Further graphs ...................................................................... 68 2.9. Exercises .............................................................................. 70 Chapter 3 Probability ...................................................................72 3.1. Introduction .......................................................................... 72 3.2. Probability theory .................................................................. 75 3.3. Exercises .............................................................................. 95 Chapter 4 Statistical Tests I .........................................................98 4.1. Principle of statistical tests ...................................................... 98 4.2. t-test ................................................................................. 102 4.3. Wilcoxon rank-sum test ........................................................ 107 4.4. Excercises ........................................................................... 112 Chapter 5 Statistical Tests II .....................................................113 5.1. More about independent samples t-test .................................. 113 5.2. Chi-Square Test................................................................... 119 5.3. Paired Tests ........................................................................ 124 5.4. Confidence intervals ............................................................. 133 5.5. One-sided versus two-sided tests .......................................... 135 5.5 Exercises ............................................................................. 137 Appendices.................................................................................142 A. Opening and importing data files ............................................ 142 B. Data management with SPSS ................................................. 154 C. Restructuring a longitdudinal data set with SPSS ..................... 167 D. Measuring agreement ......................................................... 178 E. Reference values .................................................................. 189 F. SPSS-Syntax ........................................................................ 195 G. Exact tests ........................................................................ 201 H. Equivalence trials ............................................................... 208 I. Describing statistical methods for medical publications .............. 210 J. Dictionary: English-German ................................................... 211 References .................................................................................215 2 Preface These lecture notes are intended for the PGMW course “Biometrie I”. This manuscript is the first part of the lecture notes of the “Medical Biostatistics 1” course for PhD students of the Medical University of Vienna. The lecture notes are based on material previously used in the seminars „Biometrische Software I: Beschreibung und Visualisierung von medizinischen Daten“ and „Biometrische Software II: Statistische Tests bei medizinischen Fragestellungen“. Statistical computations are based on SPSS 17.0 for Windows, Version 17.0.1 (1st Dec 2008, Copyright © SPSS Inc., 1993-2007). The data sets used in the lecture notes can be downloaded at http://www.muw.ac.at/msi/biometrie/lehre Chapters 1 and 2, and Appendices A-E have been written by Georg Heinze. Harald Heinzl is the author of chapters 3-6 and Appendices F-I. Martina Mittlböck translated chapters 4 and 5 and Appendices G-I. Andreas Gleiß translated Appendices F, J and K. Georg Heinze translated chapters 1-3, 6 and Appendices A-E. Sincere thanks are given to the translators, particularly to Andreas Gleiß who assembled all pieces into one document. Version 2009-03 contains fewer typing errors, mistranslations and wrong citations than previous versions due to the efforts of Andreas Gleiß. The contents of the lecture notes have been revised by Harald Heinzl (Version 2008-10). Version 2009-03 is updated to SPSS version 17 by Daniela Dunkler, Martina Mittlböck and Andreas Gleiß. Screenshots and SPSS output which have changed only in minor aspects have not been replaced. Note that older versions of SPSS save output files with .SPO extension, while SPSS 17 uses an .SPV extension. Old output files cannot be viewed in SPSS 17. For this purpose the SPSS Smart Viewer 15 has to be used which is delivered together with SPSS 17. Further note that the language of the SPSS 17 user interface and of the SPSS 17 output can be changed independently from each other in the options menu. If you find any errors or inconsistencies, or if you come across statistical terms that are worth including in the German-English dictionary (Appendix I), then you will be asked to notify us via e-mail: harald.heinzl@meduniwien.ac.at georg.heinze@meduniwien.ac.at 3 Chapter 1 Data collection 1.1. Introduction Statistics 1 can be classified into descriptive and inferential statistics. Descriptive statistics is a toolbox useful to characterize the properties of members of a sample. The tools of this toolbox comprise • • Statistics1 (mean, standard deviation, median, etc.) and Graphs (boxplot, scatterplot, pie chart, etc.) By contrast, inferential statistics provides mathematical techniques that help us in drawing conclusions about the properties of a population. These conclusions are usually based on a subset of the population of interest, called the sample. The motivation of any medical study should be a meaningful scientific question. The purpose of any study is to give an answer to a scientific question and not just searching a data base in order to find any significant associations. The scientific question always relates to a particular population. A sample is a randomly drawn subset of that population. An important requirement often imposed on a sample is that it is representative for a population. This requirement is fulfilled if • • Each individual of the population has equal probability to be selected, i.e., the selection process is independent from the scientific question Each individual is independent from each other, i. e., the selection of individual a has no influence on the probability of selection of individual b Example 1.1.1: Hip joint endoprothesis study. Consider a study which should answer the question how long hip joint endoprostheses can be used. The population related to this scientific question consists of all patients who will receive a hip joint endoprosthesis in the future. Clearly, this population comprises a potentially unlimited number of individuals. Consider all patients who received a hip joint endoprosthesis during the years 1990 – 1995 in the Vienna General Hospital. These patients will be followed for 15 years or until 1 Note that the word STATISTICS has different meanings. On this single page it is used to denote both the entire scientific field and rather simple computational formulas. Besides them, there can be other meanings as well. 7B1.1. Introduction their death on a regular basis. They constitute a representative sample of the population. An example of a sample which is not suitable to draw conclusions about the properties of the populations is the set of all patients which were scheduled for a follow-up examination in the year 2000. This sample is not representative because we miss any patients who died or underwent a revision up to that year and the results would be over-optimistic. A sample always consists of Observations on individuals (e. g., patients) Properties or variables which were observed (e. g., systolic blood pressure before and after five minutes of training, sex, body-massindex, etc.) and which vary in the individuals • • A sample can be represented in a table, which is often called data matrix. In a data matrix, rows usually correspond to observations and columns to variables. Example 1.1.2: Data matrix. Representation of the variables patient number, sex, age, weight and height of four patients: Pat. No. Sex 1 2 3 4 M F F M Age (years) 35 36 37 30 5 Weight (kg) 80 55 61 72 Height (cm) 185 167 173 180 8B1.2. Data collection 1.2. Data collection Case report forms Data are usually collected on paper using so-called case report forms (CRFs). On these, all study-relevant data of a patient are recorded. Case report forms should be designed such that three principles are observed: • • • Unambiguousness: the exact format of data values should be given, e. g. YYYYMMDD for date variables Clearness: the case report forms should be easy to use Parsimony: only required data should be recorded The last principle is very important. If case report forms are too long, then the motivation of the person recording the data will decrease and data quality will be negatively affected. Building a data base After data collection on paper, the next step is data entry in a computer system. While small data sets can be easily transferred directly to a data matrix on the screen, one can make use of computer forms to enter large amounts of data. These computer forms are the analogue of case report forms on a computer screen. Data are typed into the various fields of the form. Commercial programs allowing the design and use of forms are, e.g., Microsoft Office Access or SAS. Epi Info™ is a freeware program (http://www.cdc.gov/epiinfo/, cf. Fig. 1) using the format of Access to save data, but with a graphical user interface that is easier to handle than that of Access. 6 8B1.2. Data collection Fig.1: computer form prepared with Epi Info™ 3.5 After data entry using forms, the data values are saved in data matrices. One row of the data matrix corresponds to one form, and each column of a row corresponds to one field of a form. Electronic data bases can usually be converted from one program to another, e. g., from a data entry program to a statistical software system like SPSS, which will be used throughout the lecture. SPSS also offers a commercial program for the design of case report forms and data entry (“SPSS Data Entry”). When building a data base, no matter which program is used, the first step is to decide which variables it should consist of. In a second step we must define the properties of each variable. The following rules apply: • The first variable should contain a unique patient identification number, which is also recorded on the case report forms. 7 8B1.2. Data collection • Each property that may vary in individuals can be considered as a variable. • With repeated measurements on individuals (e.g., before and after treatment) there are several alternatives: o Wide format: One row (form) per individual; repeated measurements on the same property are recorded in multiple columns (or fields on the form); e. g., PAT=patient identification number, VAS1, VAS2, VAS3 = repeated measurements on VAS o Long format: One row per individual and measurement; repeated measurements on the same property are recorded in multiple rows of the same column, using a separate column to define the time of measurement; e.g., PAT=patient identification number, TIME=time of measurement, VAS=value of VAS at the time of measurement If for the first alternative the number of columns (fields) becomes too large such that computer forms become too complex, the second alternative will be chosen. Note that we can always restructure data from the wide to the long format and vice versa. Building an SPSS data file The statistical software package SPSS appears in several windows: The data editor is used to build data bases, to enter data by hand, to import data from other programs, to edit data, and to perform interactive data analyses. The viewer collects results of data analyses. The chart editor facilitates the modification of diagrams prepared by SPSS and allows identifying individuals on a chart. Using the syntax editor, commands can be entered and collected in program scripts, which facilitates non-interactive automated data analysis. 8 8B1.2. Data collection Data files are represented in the data editor, which consists of two tables (views). The data view shows the data matrix with rows and columns corresponding to individuals and variables, respectively. The variable view contains the properties of the variables. It can be used to define new variables or to modify properties of existing variables. These properties are: • Name: unique alphanumeric name of a variable, starting with an alphabetic character. The name may consist of alphabetic and numeric characters and the underscore (“_”). Neither spaces nor special characters are allowed. • Width: the maximum number of places. • Decimals: the number of decimal places. • Label: a description of the variable. The label will be used to name variables in all menus and in the output viewer. • Values: labels assigned to values of - nominal or ordinal - variables. The value labels replace values in any category listings. Example: value 1 corresponds to value label ‘male’, value 2 to label ‘female’. Data are entered as 1 and 2, but SPSS displays ‘male’ and ‘female’. Using the button one can switch between the display of the values and the value labels in the data view. Within the value label view, one can directly choose from the defined value labels when entering data. 9 8B1.2. Data collection • Missing: particular values may be defined as missing values. These values are not used in any analyses. Usually, missing data values are represented as empty fields and if a data value is missing for an individual, it is left empty in the data matrix. • Columns: defines the width of the data matrix column of a variable as number of characters. This value is only relevant for display and changes if one column is broadened using the mouse. • Align: defines the alignment of the data matrix column of a variable (left/right/center). • Measure: nominal, ordinal or scale. The applicability of particular statistical operations on a variable (e. g., computing the mean) depends on the measure of the variable. The measure of a variable is called o nominal if each observation belongs to one of a set of categories, and it is called o ordinal if these categories have a natural order. SPSS calls the measure of a variable o ‘scale’ if observations are numerical values that represent different magnitudes of the variable. Usually (outside SPSS), such variables are called `metric’, ‘continuous’ or ‘quantitative’, compared to the ‘qualitative’ and ‘semi-quantitative’ nature of nominal and ordinal variables, respectively. Examples for nominal variables are sex, type of operation, or type of treatment. Examples for ordinal variables are treatment dose, tumor stage, response rating, etc. Scale variables are, e. g., height, weight, blood pressure, age. • Type: the format in which data values are stored. The most important are the numeric, string, and date formats. o Nominal and ordinal variables: choose the numeric type. Categories should be coded as 1, 2, 3, … or 0, 1, 2, … Value labels should be used to paraphrase the codes. o Scale variables: choose the numeric type, pay attention to the correct number of decimal places, which applies to all computed statistics (e. g., if a variable is defined with 2 decimal places, and you compute the mean of that variable, then 2 decimal places will be shown). However, the computational accuracy is not affected by this option. 10 8B1.2. Data collection o Date variables: choose the date type; otherwise, SPSS won’t be able to compute the length of time passing between two dates correctly. The string format should only be used for text that will not be analyzed statistically (e. g., addresses, remarks). For nominal or ordinal variables, the numeric type should be used throughout as it requires an exact definition of categories. The list of possible categories can be extended while entering data, and category codes can be recoded after data entry. Example 1.2.1: Consider the variable “location” and the possible outcome category “pancreas” and “stomach”. How should this variable be defined? Proper definition: The variable is defined properly, if these categories are given two numeric codes, 1 and 2, say. Value labels paraphrase these codes: In any output produced by SPSS, these value labels will be used instead of the numeric codes. When entering data, the user may choose between any of the predefined outcome categories: 11 8B1.2. Data collection Improper definition: The variable is defined with string type of length 10. The user enters alphabetical characters instead of choosing from a list (or entering numbers). This may easily lead to various versions of the same category: All entries in the column “location” will be treated as separate categories. Thus the program works with six different categories instead of two. Further remarks applying to data entry with any program: • Numerical variables should only contain numbers and no units of measurements (e.g. “kg”, “mm Hg”, “points”) or other alphabetical 12 8B1.2. Data collection or special characters. This is of special importance if a spreadsheet program like Microsoft Excel is used for data entry. Unlike real data base programs, spreadsheet programs allow the user to enter any type of data in any cell, so the user solely takes responsibility over the entries. • “True” missing values should be left empty rather than using special codes for them (e.g. -999, -998, -997). If special codes are used, they must be defined as missing value codes and they should be defined as value lables as well. Special codes can be advantageous for “temporal” missing values (e.g. -999=”ask patient”, -998=”ask nurse”, -997=”check CRF”). • A missing value means that the value is missing. By constrast, in Microsoft® Office Excel® an empty cell is sometimes interpreted as zero. • Imprecise values can be characterized by adding a column showing the degree of certainty that is associated with such values (e. g., 1=exact value, 0=imprecise value). This allows the analyst to drive two analyses: one with exact values only, and one using also imprecise values. By no way should imprecisely collected data values be tagged by a question mark! This will turn the column into a string format, and SPSS (or any other statistics program) will not be able to use it for analyses. • Enter numbers without using separators (e. g., enter 1000 as 1000 not as 1,000). If in a data base or statistics program a variable is defined as numeric then it is not possible to enter something else than numbers! Therefore, programs that do not distinguish variable types are error-prone (e. g., Excel). For a more sophisticated discussion about data management issues the reader is referred to Appendices A and B. 13 9B1.3. Simple analyses 1.3. Simple analyses In small data sets values can be checked by computing frequencies for a variable. This can be done using the menu Analyze-Descriptive Statistics-Frequencies. Put all variables into the field Variables and press OK. The SPSS Viewer window pops up and shows a frequency table for each variable. Within each frequency table, values are sorted in ascending order. This enables the user to quickly check the minimum and maximum values and discover implausible values. lie_3 Valid 120 Frequency 1 Percent 6.7 Valid Percent 6.7 Cumulative Percent 6.7 125 1 6.7 6.7 13.3 127 1 6.7 6.7 20.0 136 1 6.7 6.7 26.7 143 1 6.7 6.7 33.3 144 1 6.7 6.7 40.0 145 2 13.3 13.3 53.3 150 3 20.0 20.0 73.3 152 1 6.7 6.7 80.0 155 1 6.7 6.7 86.7 165 1 6.7 6.7 93.3 203 1 6.7 6.7 100.0 Total 15 100.0 100.0 The columns have the following meanings: • • • • Frequency: the absolute frequency of observations having the value shown in the first column Percent: percentage of observations with values equal to the value in the first column relative to total sample, including observations with missing values Valid percent: percentage of observations with values equal to the value in the first column relative to total sample, excluding observations with missing values Cumulative percent: percentage of observations with values up to the value shown in the first column. E. g., in line 145 a cumulative percentage of 53.3 means that 53.3% of the probands have blood pressure values less than or equal to 145. Cumulative percents refer to valid percents, that is, they exclude missing values. 14 9B1.3. Simple analyses Frequency tables are particulary useful for describing the distribution of nominal or ordinal variables: lie_3c Valid Normal Frequency Percent Valid Percent Cumulative Percent 11 73.3 73.3 High 3 20.0 20.0 93.3 Very high 1 6.7 6.7 100.0 15 100.0 100.0 Total 73.3 Obviously, variable “lie_3c” is ordinal. From the table we learn that 93.3% of the probands have normal or high blood pressure. 15 10B1.4. Aggregated data 1.4. Aggregated data SPSS is able to handle data that are already aggregated, i. e. data sets that have already been compiled to a frequency table. In the data set shown below, each observation corresponds to a category constituted by a unique combination of variable values. The variable “frequency” shows how many observations fall in each category: As we see, “frequency” is not a variable defining some property of the patients, it rather acts as a counter. Therefore, we must inform SPSS about the special meaning of “frequency”. This is done by choosing DataWeight cases from the menu and putting Number of patients (frequency) into the field Frequency Variable:. 16 10B1.4. Aggregated data Producing a frequency table for the variable “ageclass” we obtain: Age class Valid <40 40-60 Frequency 111 Percent 45.5 Valid Percent 45.5 Cumulative Percent 45.5 23 9.4 9.4 54.9 >60 110 45.1 45.1 100.0 Total 244 100.0 100.0 If the weighting had not been performed, the table would erroneously read as follows: Age class Valid Cumulative Percent 25.0 Frequency 3 Percent 25.0 Valid Percent 25.0 40-60 4 33.3 33.3 58.3 >60 5 41.7 41.7 100.0 Total 12 100.0 100.0 <40 This table just counts the number of rows with the corresponding ageclass values! 17 11B1.5. Exercises 1.5. Exercises 1.5.1. Cornea: Source: Guggenmoos-Holzmann and Wernecke (1995). In an ophthalmology department the effect of age on cornea temperature was investigated. Forty-three patients in four age groups were measured: Age group Measurements 12-29 35.0 34.1 33.4 35.2 35.3 34.2 34.6 35.7 34.9 30-42 34.5 34.4 35.5 34.7 34.6 34.9 34.6 34.9 33.0 34.1 33.9 34.5 43-55 35.0 33.1 33.6 33.6 34.2 34.5 34.3 32.5 33.2 33.2 56-73 34.5 34.7 35.0 34.1 33.8 34.0 34.3 34.9 34.5 34.5 33.4 34.2 Create an SPSS data file. Save the file as “cornea.sav”. 1.5.2. Psychosis and type of constitution Source: Lorenz (1988). 8099 patients suffering from endogenous psychosis or epilepsy were classified into five groups according to their type of constitution (asthenic, pyknic, athletic, dysplastic, and atypical) and into three classes according to type of psychosis. Each patient falls into one of the 15 resulting categories. The frequency of each category is depicted below: schizophrenia manicdepressive disorder epilepsy 2632 261 378 pyknic 717 879 83 athletic 884 91 435 dysplastic 550 15 444 atypical 450 115 165 asthenic Which are the variables of this data set? Enter the data set into SPSS and save it as “psychosis.sav”. 18 12B2.1. Overview Chapter 2 Statistics and graphs 2.1. Overview Typically, a descriptive statistical analysis of a sample takes two steps: Step 1: The data is explored, graphically and by means of statistical measures. The purpose of this step is to obtain an overview of distributions and associations in the data set. Thereby we don’t restrict ourselves to the main scientific question. Step 2: For describing the sample (e. g., for a talk or a paper) only those graphs and measures will be used, that allow the most concise conclusion about the data distribution. Unnecessary and redundant measures will be omitted. The statistical measures used for description are usually summarized in a table (e. g., a “patient characteristics” table). When choosing appropriate statistical measures (“statistics”) and graphs, one has to consider the measurement type (in SPSS denoted by “measure”) of the variables (nominal, ordinal, or scale). The following statistics and graphs are available: • Nominal variables (e. g. sex or type of operation) o Statistics: frequency, percentage o Graphs: bar chart, pie chart • Ordinal variables (e. g. tumor stage, school grades, categorized scales) o Statistics: frequency, percentage, median, quartiles, percentiles, minimum, maximum o Graphs: bar chart, pie chart, and – with reservations - box plot • Scale variables (e. g. height, weight, age, leukocyte count) o Statistics: median, quartiles, percentiles, minimum, maximum, mean, standard deviation, if data is categorized: frequency and percentage o Graphs: box plot, histogram, error bar plot, dot plot, bar chart and pie chart if data is categorized 19 13B2.2. Graphs The statistical description of distributions is shown on a data set called “cholesterol.sav”. This data set contains the variables age, sex, height, weight, cholesterol level, type of occupation, sports abdominal girth and hip dimension of 83 healthy probands. First, the total sample is described, then the description is grouped by sex. 2.2. Graphs SPSS distinguishes between graphs using the Chart Builder... and older versions of (interactive and non-interactive) graphs using Legacy Dialogs. Although the possibilities of these types of graphs overlap, there are some diagrams that can only be achieved by one or the other way. Most diagrams needed in the course can be done using the Chart Builder. It is important to note that the chart preview window within the Chart Builder dialogue window does not represent the data but only gives a scetch of a typical chart of the selected type. Further note that graphs which had been constructed using the menu Graphs - Legacy Dialogs – Interactive cannot be changed interactively in the same manner as in version 14, but are also edited using the chart editor. 20 14B2.3. Describing the distribution of nominal variables 2.3. Describing the distribution of nominal variables Sex and type of occupation are the nominal variables in our data set. A pie chart showing the distribution of type of occupation is created by choosing Graphs-Chart Builder... from the menu and dragging Pie/Polar from the Gallery tab to the Chart Preview field: Drag Type of occupation [occupation] into the field Slice by? and press OK. The pie chart is shown in the SPSS Viewer window: 21 14B2.3. Describing the distribution of nominal variables Double-clicking the chart opens the Chart Editor which allows to change colors, styles, labels etc. E. g., the counts and percentages shown in the chart above can be added by selecting Show Data Labels in the Elements menu of the Chart Editor. In the Properties window which pops up, Count and Percentages are moved to the Labels Displayed field using the green arrow button. For the Label Position we select labels to be shown outside the pie and finally press Apply. 22 14B2.3. Describing the distribution of nominal variables Statistics can be summarized in a table using the menu Analyze-TablesCustom Tables.... Drag Type of occupation to the Rows part in the preview field and press Summary Statistics: 23 14B2.3. Describing the distribution of nominal variables Here we select Column N % and move it to the display field using the arrow button. After clicking Apply to Selection and OK we arrive at the following table: Type of occupation 24 Col % 28.9% mostly standing 37 44.6% mostly in motion 16 19.3% 6 7.2% mostly sitting retired Count The table may be sorted by cell count by pressing the Categories and Totals... button in the custom tables window, which is sometimes useful for nominal variables. With ordinal or scale variables, however, the order of the value labels is crucial and has to be maintained. 24 14B2.3. Describing the distribution of nominal variables We can repeat the analyses building subgroups by sex. In the Chart Builder, check Columns panel variable in the Groups/Point ID tab and then drag Sex into the appearing field Panel?: 25 14B2.3. Describing the distribution of nominal variables The same information is provided by a grouped bar chart. It is produced by pressing Reset in the Chart Builder and selecting Bar in the Gallery tab and dragging the third bar chart variant (Stacked Bar) to the preview field: 26 14B2.3. Describing the distribution of nominal variables Drag Type of Occupation to the X-Axis ? and Sex to the Stack: set color field: 27 14B2.3. Describing the distribution of nominal variables 28 14B2.3. Describing the distribution of nominal variables Sometimes it is useful to compare percentages within groups instead of absolute counts. Select Percentage in the Statistics field of the Element Properties window beside the Chart Builder window and then press Set Parameters... in order to select Total for Each X-Axis Category. Do not forget to confirm your selections by pressing Continue and Apply, respectively. After closing the Chart Builder with OK, the bars of the four occupational types now will be equalized to the same height, and then the proportion of males and females can be compared more easily between occupational types. The scaling to 100% within each category can also be done ex post in the Options menu of the Chart Editor. 29 14B2.3. Describing the distribution of nominal variables The order of appearance can be changed (e. g., to bring female on top) by double-clicking on any bar within the Chart Editor and selecting the Categories tab: The sorting order can also be changed for the type of occupation by double-clicking on any label on the X-axis within the Chart Editor and selecting the Categories tab again. 30 14B2.3. Describing the distribution of nominal variables To create a table containing frequencies and percentages of type of occupation broken down by sex, choose Analyze-Tables-Custom Tables and, in addition to the selections shown above, drag the variable Sex to the Columns field, and request Row N % in the Summary Statistics dialogue. Sex male Type of occupation 10 Col % 25.0% Row % 41.7% mostly standing 22 55.0% mostly in motion 7 17.5% retired 1 2.5% mostly sitting Count female Count 14 Col % 32.6% Row % 58.3% 59.5% 15 34.9% 40.5% 43.8% 9 20.9% 56.3% 16.7% 5 11.6% 83.3% The Row% sum up to 100% within a row, the Col% values sum up within each column. Therefore, we can compare the proportion of males between types of occupation by looking at Row%, and we compare the proportion of each type of occupation between the two sexes by looking at the Col%. 31 15B2.4. Describing the distribution of ordinal variables 2.4. Describing the distribution of ordinal variables All methods for nominal variables also apply to ordinal variables. Additionally, one may use the so-called non-parametric or distribution-free statistics which make specific use of the ordinal information contained in the variable. In our data set, the only ordinal variable is “sports”, characterizing the intensity of leisure sports of the probands. Although the calculation of nonparametric statistics is possible here, with such crude classification the methods for nominal variables do better. However, some attention must be paid to the correct order of categories – with ordinal variables, categories may not be interchanged. The frequency table and bar chart for sports, grouped by sex, look as follows: Sex male Sports 7 Col % 17.5% Row % 50.0% seldom 22 55.0% sometimes 11 27.5% never Count female often 32 Count 7 Col % 16.3% Row % 50.0% 52.4% 20 46.5% 47.6% 45.8% 13 30.2% 54.2% 3 7.0% 100.0% 16B2.5. Describing the distribution of scale variables 2.5. Describing the distribution of scale variables Histogram The so-called histogram serves as a graphical tool showing the distribution of a scale variable. It is mostly used in the explorative step of data analysis. The histogram depicts the frequencies of a categorized scale variable, similarly to a bar chart. However, in a histogram there is no space between the bars (because consecutive categories border on each other), and it is not allowed to interchange bars. In the following example, 143 students of the University of Connecticut were arranged according to their height (source: Schilling et al, 2002). This resulted in a “living histogram”: Scale variables must be categorized before they can be characterized using frequencies (e. g., age: 0-40=young, >40=old). Affording a histogram with SPSS involves an automatic categorization which is done by SPSS before computing the category counts. The category borders can later be edited “by hand”. The histogram is created by choosing the first variant (Simple Histogram) from the histograms offered in the Gallery tab of the Chart Builder. To create a histogram for abdominal girth, drag the variable Abdominal girth (cm) [waist] into the field X-Axis ? and press OK. 33 16B2.5. Describing the distribution of scale variables 34 16B2.5. Describing the distribution of scale variables Suppose we want abdominal girth to be categorized into categories of 10 cm each. Double-click the graph and double-click any bar of the histogram. In the Properties window which pops up select the Binning tab and check Custom and Interval width in the X-Axis field. Entering the value 10 and confirming your choice by clicking Apply updates the histogram to the new settings: 35 16B2.5. Describing the distribution of scale variables From the shape of the histogram we learn that the distribution is not symmetric, the tail on the right hand side is longer than the tail on the left hand side. Distributions of that shape are called “right-skewed”. They often originate from a natural lower limit in the variable. The number of intervals to be displayed is a matter of taste and depends on the total sample size. The number should be specified such that the frequency of each interval is not too small such that the histogram contains artificial “wholes”. 36 16B2.5. Describing the distribution of scale variables Histograms can also be compared between subgroups. For this purpose perform the same steps within the Chart Builder as in the case of pie charts: The histograms use the same scaling on both axes, such that they can be compared easily. Note that a comparison of the abdominal girth distributions between both sexes requires the use of relative frequencies (by selecting Histogram Percent in the Element Properties window beside the Chart Builder). 37 16B2.5. Describing the distribution of scale variables Dot plot The dot plot serves to compare the individual values of a variable between groups, e. g., the abdominal girth between males and females. In the Chart Builder select the first variant (Simple Scatter) from the Scatter/Dot plots in the Gallery tab and drag Abdominal girth and Sex into the vertical and horizontal axis fields, respectively: 38 16B2.5. Describing the distribution of scale variables The dot plot may suffer from ties in the data, i. e., two or more dots superimposed, which may obscure the true distribution. Later a variant of the dot plot is introduced that overcomes that problem by a parallel depiction of superimposed dots. Using proper statistical measures Descriptive statistics should give a concise description of the distribution of variables. There are measures describing the position and measures describing the spread of a distribution. We distinguish between parametric and nonparametric (distribution-free) statistics. Parametric statistics assume that the data follow a particular theoretical distribution, usually the normal distribution. If this assumption holds, parametric measures allow a more concise description of the distribution than nonparametric measures (because less numbers are needed). If the assumption does not hold, then nonparametric measures must be used in order to avoid confusing the target audience. 39 16B2.5. Describing the distribution of scale variables Nonparametric statistics Nonparametric measures make sense for ordinal or scale variables as they are based on the sorted sample. They are roughly defined as follows (for a more stringent definition see below): • Median: midpoint of the observations when they are ordered from the smallest to the highest value (50% fall below and 50% fall above this point) • 25th Percentile (first or lower quartile): the value such that 25 percent of the observations fall below or at that value • 75th Percentile (third or upper quartile): the value such that 75 percent of the observations fall below or at that value • Interquartile range (IQR): the difference between 75th percentile and 25th percentile • Minimum: the smallest value • Maximum: the highest value • Range: difference between maximum and minimum While Median, percentiles, minimum and maximum characterize the position of a distribution, interquartile range and range describe the spread of a distribution, i. e., the variation of a feature. IQR and range can be used for scale but not for ordinal variables as differences between two values usually make no sense for the latter. The exact definition of the median (and analogously of the quartiles) is as follows: the median is that value smaller than or equal to which are at least 50% of the observations and greater than or equal to which are at least 50% of the observations. The box plot depicts all of those nonparametric statistics in one graph. Surely, it is one of the most important graphical tools in statistics. It allows to compare groups at a glance without much reducing the information contained in the data and without assuming any theoretical shape of the distribution. Box plots help us to decide if a distribution is symmetric, right-skewed or left-skewed and if there are very large or very small values (outliers). The only drawback of the box plot compared to the histogram is that it is unable to depict multi-modal distributions (histograms with two or more peaks). The box contains the central 50% of the data. The line in the box marks the median. Whiskers extend from the box up to the smallest/highest observations that lies within 1.5 IQR from the quartiles. Observations that lie farer away from the quartiles are marked by a circle (1.5 – 3 IQR) or an asterisk (more than 3 IQR from the quartiles). 40 16B2.5. Describing the distribution of scale variables The use of the box plot for ordinal variables is limited as the outlier definition refers to IQR. Box plots are obtained in the Chart Builder by dragging the first variant of the Boxplots offered in the Gallery tab to the preview field. 41 16B2.5. Describing the distribution of scale variables Check Point ID Label in the Groups/Point ID tab and drag the Pat.No.[id] variable to the Point Label Variable? field which appears in the preview field. Thus potentially shown extreme values or outliers are labelled by that variable instead of by the row number in the data view. As before, checking Columns Panel Variable in the same tab would allow to specify another variable that is used to generate multiple charts following the values of that variable. The box plot does well in depicting the distribution actually observed, as shown by the comparison with the original values (dot plot): 42 16B2.5. Describing the distribution of scale variables Parametric statistics and the normal distribution When creating a histogram of body height we can check the Display normal curve option in the Element Properties window beside the Chart Builder window: The bars of the histogram more or less follow the normal curve. This curve follows a mathematical formula, which is characterized by two parameters, the mean µ and the standard deviation σ. Knowing these two 43 16B2.5. Describing the distribution of scale variables parameters, we can specify areas where, e. g., the smallest 5% of the data are located or where the middle 2/3 can be expected. These proportions are obtained by the formula of the distribution function of the normal distribution: x − µ 2 1 = exp − P ( y ≤ x) ∫ 2 dx −∞ 2π σ x P ( y ≤ x ) denotes the probability that a value y which is drawn randomly from a normal distribution with mean µ and standard deviation σ, is equal to or less than x. The usually unknown parameters µ and σ are estimated by their sample values µ̂ (the sample mean) and σˆ (the sample standard deviation, SD): Mean = µ̂ = 1 N N ∑ xi SD = σˆ = i =1 1 N ∑ (xi − µˆ ) N − 1 i =1 2 Assuming a normal distribution and using the above formula we obtain the following data areas: • Mean: 50% of the observations fall below the mean and 50% fall above the mean • Mean-SD, Mean+SD: 68% (roughly two thirds) of the observations fall into that area • Mean-2SD, Mean+2SD: 95% of the observations fall into that area • Mean-3SD, Mean+3SD: 99.7% of the observations fall into that area Note: Although mean and standard deviation can be computed from any variable, the above interpretation is only valid for normally distributed variables. Displaying mean and SD: how not to do it Often mean and SD are depicted in bar charts as the one shown below. However, experienced researchers discourage from using such charts. Their reasons follow from the comparison with the dot plot: 44 16B2.5. Describing the distribution of scale variables Some statistics programs offer bar charts with whiskers showing the standard deviation to depict mean and standard deviation of a variable. The bar pretends that the data area begins at the origin of the bar, which is 150 in our example. When comparing with the dot plot, we see that the minimum for males is much higher, about 162 cm. Furthermore, the mean, which is a single value, is represented by a bar, which covers a range of values. On the other hand, the standard deviation, which is a measure showing the spread of a distribution, is only plotted above the mean. This pretends that the variable spreads only into one direction from the mean. The standard deviation should always be plotted into both directions from the mean. Furthermore, the origin of 150 depicted here is not a natural one, therefore, choice of the length of the bars is completely arbitrary. The relationships of the bars give a wrong idea of the relationships of the means, which should always be seen relative to the spread of the distributions. By changing the y-axis scale, the impression of a difference can be enforced or attenuated easily, as can be seen by the following comparison: 45 16B2.5. Describing the distribution of scale variables Displaying mean and SD: how to do it Mean and SD are correctly depicted in an error bar chart. Select the third variant in the second row (Simple Error Bar) of the Bar Charts offered in the Gallery tab: Put the variable of interest (height in our example) into the vertical axis field, and the grouping variable (sex) into the horizontal axis field. 46 16B2.5. Describing the distribution of scale variables In the field Error Bars Represent of the Elements Properties window check Standard Deviation. The multiplier should be set to 1.0 to have one SD plotted on top of and below the mean. If the multiplier is set to 2.0, then the error bars will cover an area of mean plus/minus two standard deviations. Assuming normal distributions, about 95% of the observations fall into that area. Usually, one standard deviation is shown in an error bar plot, corresponding to 2/3 of the observations. We obtain a chart showing two error bars, corresponding to males and females: 47 16B2.5. Describing the distribution of scale variables The menu also offers to use the standard error of the mean (SEM) instead of the SD. This statistic measures the accuracy of the sample mean with respect to estimating the population mean. It is defined as the standard deviation divided by the square root of N. Therefore, the precision of the estimate of the population mean is proportional to the number of observations. However, the SEM does not show the spread of the data. Therefore, it should not to be used to describe samples. Verifying the normal assumption In small samples it can be very difficult to verify the assumption of normal distribution in a variable. The following illustration shows the variety in observed distribution of normally and right skewed data sets. Values were sampled from a normal distribution and from a right-skewed distribution and then collected into data sets of size 10, 30 and 100. As can be seen, the histograms of the samples of size N=10 often look asymmetric and do not reflect a normal distribution. By contrast, some of the histograms of the right-skewed variable are close to a normal distribution. This effect is caused by random variation, which is considerably high in small samples. Therefore it is very difficult to obtain a unique decision about the normal assumption in small samples. 48 16B2.5. Describing the distribution of scale variables Histograms of normally distributed data of sample size 10: Count 8 Count Count 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 2 6 4 2 8 6 4 2 8 Count 1 4 8 6 4 2 8 Count 0 6 6 4 2 50 75 100 125 50 75 100 125 Normal 50 75 100 125 Normal Normal 50 75 100 125 Normal 50 75 100 125 Normal 50 75 100 125 Normal Histograms of normally distributed data of sample size 30: 0 1 2 3 4 5 6 7 8 9 Count 12 8 4 0 Count 12 8 4 0 Count 12 8 4 0 50 75 100 Normal 125 50 75 100 Normal 125 Histograms of normally distributed data of sample size 100: 0 1 25 Count 20 15 10 5 2 25 Count 20 15 10 5 50 75 100 125 Normal 49 16B2.5. Describing the distribution of scale variables Histograms of right-skewed data of sample size 10: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Count 10 8 6 4 2 Count 10 8 6 4 2 Count 10 8 6 4 2 Count 10 8 6 4 2 Count 10 8 6 4 2 2.5 5.0 7.5 12.5 10.0 rightskew ed 2.5 5.0 7.5 12.5 10.0 rightskew ed 2.5 5.0 7.5 12.5 10.0 rightskew ed 2.5 5.0 7.5 12.5 10.0 rightskew ed 2.5 5.0 7.5 12.5 10.0 rightskew ed 2.5 5.0 7.5 12.5 10.0 rightskew ed Histograms of right-skewed data of sample size 30: 25 0 1 2 3 4 5 6 7 8 9 Count 20 15 10 5 25 Count 20 15 10 5 25 Count 20 15 10 5 2.5 5.0 2.5 7.5 10.0 12.5 5.0 7.5 10.0 12.5 rightskew ed rightskew ed Histograms of right-skewed data of sample size 100: 0 75 1 Count 50 25 0 2 75 Count 50 25 0 2.5 5.0 7.5 10.0 12.5 rightskewed 50 16B2.5. Describing the distribution of scale variables A comparison of an error bar plot with the bars extending up to two SDs from the mean and a dot plot may help in deciding whether the normal assumption can be adopted. Both charts must use the same scaling on the vertical axis. If the error bar plot reflects the range of data as shown by the dot plot, then it is meaningful to describe the distribution using the mean and the SD. If the error bar plot gives the impression that the distribution is shifted up or down compared to the dot plot, then the data distribution should be described using nonparametric statistics. As an example, let us first consider the variable “height”. A comparison of the error bar plot and the dot plot shows good agreement: 2 00 .0 0 A A 1 90 .0 0 1 80 .0 0 ] 1 70 .0 0 ] Height (cm) Height (cm) 1 90 .0 0 1 80 .0 0 1 70 .0 0 n= 40 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A 1 60 .0 0 1 60 .0 0 A A A n= 43 m al e fe ma l e m al e Sex fe ma l e Sex The situation is different for the variable “Abdominal girth” (waist)”. Both error bars seem to be shifted downwards. This results from the rightskewed distribution of abdominal girth. Therefore, the nonparametric statistics should be used for description. 51 16B2.5. Describing the distribution of scale variables A 1 00 .0 0 ] ] 7 5.0 0 Abdominal girth (cm) Abdominal girth (cm) A 1 25 .0 0 1 25 .0 0 1 00 .0 0 7 5.0 0 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A n= 40 5 0.0 0 5 0.0 0 n= 43 m al e fe ma l e m al e Sex fe ma l e Sex In practice, box plots are used more frequently than error bar plots to show distributions in subgroups. Using SPSS one may also symbolize the mean in a box plot (see later). However, error bar plots are useful when illustrating repeated measurements of a variable, as they need less space than box plots. Computing the statistics Statistical measures describing the position or spread of a distribution can be computed using the menu Analyze-Tables-Custom Tables.... To compute measures for the variable “height”, in separate rows according to the groups defined by sex, choose that menu and first drag the variable sex to the Rows area in the preview field. Then drag the height variable to the produced two-row table such that the appearing red frame covers the right-hand part of the male and the female cells: 52 16B2.5. Describing the distribution of scale variables We have seen from the considerations above that height can be assumed to be normally distributed. Therefore, the most concise description uses mean and standard deviation. Additionally, the frequency of each group should be included in a summary table. Press Summary Statistics... and select Count, Mean and Std. Deviation: After pressing Apply to Selection and OK, the output table reads as follow: 53 16B2.5. Describing the distribution of scale variables Sex male Height (cm) female Height (cm) Count 40 Mean 179.38 Std Deviation 6.83 43 167.16 6.50 The variable “Abdominal girth” is best described using nonparametric measures. These statistics have to be selected in the Summary Statistics... submenu: Sex male Abdominal girth (cm) female Abdominal girth (cm) Count 40 Median 91.50 Percentile 25 86.00 Percentile 75 101.75 Range 51.00 43 86.00 75.00 103.00 86.00 54 16B2.5. Describing the distribution of scale variables Transferring output to other programs After explorative data analysis, some of the results will be used for presentation of study results. Parts of the output collected in the SPSS Viewer can be copied and pasted in other programs, e. g. MS Powerpoint or MS Word 2007. • Charts can simply be selected and copied (Strg-C, Ctrl-C). In the target program (Word or Powerpoint 2007) choose Edit-Paste (Strg-V, Ctrl-V) to paste the as an editable graph. • Tables are transferred the following way: select the table in the SPSS viewer, and choose Edit-Copy (Strg-C, Ctrl-C) to copy it, and in the target program choose Edit-Paste (Strg-V, Ctrl-V) to paste it as an editable table. 55 16B2.5. Describing the distribution of scale variables Summary To decide which measure and chart to use for describing the distribution of a scale variable, first verify the distribution using histograms (by subgroups) or box plots. Check histogram or box plot: Normally distributed? m ale Not normally distributed? fe ma le m ale fe ma le 10.0 7.5 Count Count 15 10 5.0 5 2.5 0 160.00 170.00 180.00 190.00 160.00 Height (cm ) 170.00 180.00 0.0 60.00 190.00 Height (cm ) 80.00 100.00 120.00 140.00 60.00 80.00 Abdom inal girth (cm) A A 120.00 140.00 A 1 40 .0 0 Abdominal girth (cm) 1 90 .0 0 A Height (cm) 100.00 Abdom inal girth (cm) A 1 80 .0 0 1 70 .0 0 A 1 60 .0 0 1 20 .0 0 1 00 .0 0 8 0.0 0 A A m al e fe ma l e 6 0.0 0 Sex m al e fe ma l e Sex Measure of center: Mean Measure of spread: Standard Deviation Chart: Median Additionally, minimum and maximum may be used Error bar plot Large N? 1st, 3rd quartiles Box plot (including means) Small N? Minimum, Maximum Box plot A A A 1 40 .0 0 170.00 ] Height (cm) Height (cm) 180.00 n= 40 1 80 .0 0 A A a 1 70 .0 0 a ] A Abdominal girth (cm) 1 90 .0 0 1 20 .0 0 1 00 .0 0 8 0.0 0 1 60 .0 0 A A 160.00 n= 43 male f emale Sex m al e fe ma l e Sex 6 0.0 0 m al e fe ma l e Sex The section “Further graphs” shows how to include means into box plots. 56 17B2.6. Outliers 2.6. Outliers Some extreme values in a sample may affect results exorbitantly. Such values are called outliers. The decision on how to proceed with outliers depends on their origin: • • The outliers were produced by measurement errors or data entry errors, or may be the result of untypical circumstances at time of measurement. If the correct values are not available (e. g., because of missing case report forms, or because the measurement cannot be repeated), then these values should be excluded from further analysis. If an error or an untypical concomitant can be ruled out, then the outlying observation must be regarded as part of the distribution and should not be excluded from further analysis. An example, taken from the data file “Waist-Hip.sav”: There appears to be an observation with height less than 100 cm. This points towards a data entry error. Using box plots, such implausible values are easily detected. The outlier is identified by its Pat.No., that has been set as Point ID variable. Alternatively, we could identify the outlier by 57 17B2.6. Outliers sorting the data set by height (Data-Sort Cases). Then the outlying observation will appear in the first row of data. To exclude the observation from further analyses, choose Data-Select Cases... and check the option If condition is satisfied. Press If... and specify height > 100. Press Continue and OK. Now the untypical observation is filtered out before analysis. A new variable “filter_$” is created in the data set. If the box plot request is repeated, we still obtain the same data area as before. 58 17B2.6. Outliers 59 17B2.6. Outliers The scaling of the vertical axis must be corrected by hand, by doubleclicking the chart and subsequently double-clicking the vertical axis: 60 17B2.6. Outliers In the Scale tab uncheck the Auto option at Minimum and insert a value of 150, say. Finally, after rescaling the vertical axis we obtain a correct picture: 61 17B2.6. Outliers 62 18B2.7. Missing values 2.7. Missing values Missing values (sometimes called “not-available values” or NA’s) are a problem common to all kind of medical data. In the worst case they can lead to biased results. There are several reasons for the occurrence of missing data: • • • • • Breakdown of measurement instruments Retrospective data collection may lead to values that are no longer available Imprecise or fragmentary collection of data (e. g., questionnaires with some questions the respondent refused to reply) Missing values by design: some values may only be available for a subset of patients Drop-out of patients from studies o Patients refusing further participation o Patients are lost to follow-up o Patients are dead Reports on studies following the patients over a long time should always include information about frequency and time of drop-outs. This is best accomplished by a flow chart, which is e.g. mandatory for any study published in the British Medical Journal. Such a flow-chart may resemble the following, taken from Kendrick et al (2001) (http://bmj.bmjjournals.com/cgi/content/abstract/322/7283/400, a comparison of referral to radiography of the lumbar spine and conventional treatment for patients with low back pain): 63 18B2.7. Missing values 64 18B2.7. Missing values Furthermore, values may be partly available (censored). This is quite typical for a lot of medical research questions. Censored means that we don’t know the exact value of the variable. All we know is an interval where the exact value is contained: • Right censoring: Quite common when studying survival times of patients. Assume e.g. that we are interested in the survival of patients after colon cancer surgery. If a patient is still living at the time of statistical analysis, then all we know is a minimum survival time of, say, 7.5 years for this patient. His true but unkown survival time is within 7.5 years and infinity. This upper limit of infinity is mathematical convenience, in our example an upper limit of, say, 150 years would suffice to cover all possible values for human survival times. • Left censoring: Here a maximum for the unknown value is known. This happens quite often in laboratory settings, where a smaller value than a certain level of dection (LOD) cannot be observed. So all we know is that the true value falls within the interval between 0 and LOD. • Interval censoring: Both a lower and upper limit for the unknown value is known. For instance, we are interested in the age when a person becomes HIV-positive. The person is medically examined at an age of 22.1 years and 27.8 years. If the HIV-test is negative at the first examination but positive at the second one, then we will know that the HIV-infection happened between an age of 22.1 and 27.8 years, but we won’t know the exact age when the infection occurred. According to their impact on results, one roughly distinguishes between • Values missing at random: the reason for missing values may be known, but is independent from the variable of interest • Nonignorable missing values (also called missing not at random): the reason for missing values depends on the magnitude of the (unobserved) true value; e. g., if patients in good condition refuse to show up at follow-up examinations 65 18B2.7. Missing values Strategies for data analysis in the presence of missing values include: 1. Complete case analysis: patients with missing values are excluded from analysis 2. Reconstructing missing values (imputation): o In longitudinal studies with missing measurements, often the last available value is carried forward (LVCF) or a missing value is replaced by interpolating the neighboring observations o Missing values are replaced by values from similar individuals (nearest neighbor imputation) o Missing values are replaced by a subgroup mean (mean imputation) 3. Application of specialized techniques to account for missing or censored data (e. g., survival analysis) Any methods attempting to reconstruct missing values (second strategy, see above) need to assume that the values are missing at random. This assumption cannot be verified in a direct way (like the normality assumption). Therefore, the first option (complete case analysis) is often preferred unless special techniques are available (e.g., for survival data in the presence of right-censoring). The better alternative is an improved study design that tries to avoid missing values. Especially in cases where nonignorable missing values are plausible a careful sensitivity analysis should be performed. 66 18B2.7. Missing values Example 2.7.1: Quality of life. Consider a variable measuring quality of life (a score consisting of 10 items) one year after an operation. Suppose that all questionnaires have been filled in completely. Then the histogram of scores looks like the following: 40 Frequency 30 20 10 Mean =50.8947 Std. Dev. =20.01785 N =114 0 0.00 20.00 40.00 60.00 80.00 100.00 120.00 Quality of life 20 15 Frequency Now suppose that some patients could not answer some items, such that the score is missing for those patients (assuming missing-atrandom). The histogram of quality-oflife scores doesn’t change much: 10 5 Mean =49.7727 Std. Dev. =18.91584 N =66 0 0.00 20.00 40.00 60.00 80.00 100.00 120.00 Quality of life 40 30 Frequency Now consider a situation of nonignorable missing values: suppose that patients suffering from heavy side effects are not able to fill in the questionnaire. If the histogram is based on data of the remaining patients, we obtain: 20 10 Mean =61.8429 Std. Dev. =15.88421 N =70 0 0.00 20.00 40.00 60.00 80.00 100.00 120.00 Quality of life The third histogram shows a clear shift towards the right, and the mean is about 10 units higher than with the other two histograms. This reflects the bias occurring if the missing-at-random precondition is not fulfilled. 67 19B2.8. Further graphs 2.8. Further graphs Dot plots with heavily tied data With larger sample sizes and scale variables of moderate precision, data values are often tied, i. e., multiple observations assume the same value. A simple dot plot as shown above superimposes those observations such that the density of the distribution at such values is obscured. To overcome that problem, a dot plot that puts tied values in parallel can be created. First create a Simple Scatter plot as shown above for Abdominal girth by Sex in the “cholesterol.sav” data set. Then check the Stack identical values in the Element Properties window: 68 19B2.8. Further graphs After rescaling the vertical axis to a minimum of 50 and a maximum of 150 the resulting chart looks as follows: 69 20B2.9. Exercises 2.9. Exercises The following exercises are useful to practice the concepts of descriptive statistics. All data files can be found in the filing tray “Chapter 2”. For each exercise, create a chart and a table of summarizing statistics. 2.9.1. Coronary arteries Open the data set “coronary.sav”. Compare the achieved treadmill time between healthy and diseased individuals! Use parametric and nonparametric statistics and compare the results. 2.9.2. Cornea (cont.) Open the data set “cornea.sav”. Compare the distribution of cornea temperature measurements between the age groups by creating a chart and a table of summary statistics. 2.9.3. Hemoglobin (cont.) Open the data set “hemoglobin.sav”. Compare the level of hemoglobin between pre-menopausal and post-menopausal patients using adequate charts and statistics. Repeat the analysis for hematocrit (PCV). 2.9.4. Body-mass-index and waist-hip-ratio (cont.) Open the data set “bmi.sav”. Compare body-mass-indices of the patients in the four categories defined by sex and disease. Use adequate charts and statistics. Repeat the analysis for waist-hipratio. Pay attention to outliers! 2.9.5. Cardiac fibrillation (cont.) Open the data set “fibrillation.sav”. Compare body-mass-index between patients that were successfully treated and patients that were not (variable “success”). Repeat the analysis building subgroups by treatment. Repeat the analysis comparing potassium level and magnesium level between these subgroups. 2.9.6. Psychosis and type of constitution (cont.) Open the data file “psychosis.sav”. Find an adequate way to graphically depict the distribution of constitutional type, grouped by type of psychosis. Define “frequency” as the frequency variable! Create a table of suitable statistics. 2.9.7. Down’s syndrome (cont.) Open the data file “down.sav”. Depict the distribution of mother’s age. Is it possible to compute the median? Depict the proportion of children suffering from Down’s syndrome in subgroups defined by mother’s age. Is there a difference when compared to absolute counts? 2.9.8. Flow-chart Use the data file “patientflow.sav” and fill in the following chart showing the flow of patients through a clinical trial. 70 20B2.9. Exercises Registered patients N= Refused to participate N= Randomized N= Active group N= Placebo group N= Lost to follow up N= Refused to continue N= Died N= Lost to follow up N= Refused to continue N= Died N= 3 weeks Treated per protocol N= Treated per protocol N= Lost to follow up N= Refused to continue N= Died N= Lost to follow up N= Refused to continue N= Died N= 6 weeks Treatment completed per protocol N= 2.9.9. Treatment completed per protocol N= Box plot including means Use the data set “waist-hip.sav”. Generate box plots which include the means for the variables body-mass-index and waisthip-ratio, grouped by sex. 2.9.10. Dot plot Use the data set “waist-hip.sav”. Generate dot plots, grouped by sex, for the variables body-mass-index and waist-hip-ratio as described in section 2.5. Afterwards, generate dot plots which put tied values in parallel as described in section 2.8. 71 21B3.1. Introduction Chapter 3 Probability 3.1. Introduction In the first two chapters various methods to present medical data have been introduced. These methods comprise graphical tools (box plot, histogram, scatter plot, bar chart, etc.) and statistical measures (mean, standard deviation, median, quartiles, quantiles, percentages, etc.). These statistical tools have one particular purpose in the field of medicine: to ease the communication with colleagues. In direct conversation, at talks or in publications, the essential of empirical data should be described comprehensibly and correctly. Example 3.1.1. Of 34 Patients, 18 were treated by ABC-NEW and 16 by XYZ. Treatment with ABC-NEW led to 9 successes (50%). With XYZ, only 4 success (25%) could be observed. In example 3.1.1, the essential of the empirically collected data is described comprehensibly and correctly. This information is only interesting for others than those involved in the trial (and the patients of course!), if the results can be generalized to other patients suffering from the same disease. Then it could be used to predict the treatment success in such patients. Under which conditions can empirical results based on observations be generalized? Put another way, when can we draw conclusions from a part to the whole? Here we have arrived at an important but difficult point. Therefore, some more general thoughts have to be considered at this place. Population – Sample Assume a bin (e. g., an urn or a bag) containing many red and white balls. We are interested in the proportion of red and white balls. Many more practical problems are similar to this question: • Delivery of 5000 blood bottles in the production of blood plasma: some of them are infected by Hepatitis-B (red balls), some not (white balls). • Delivery of 1 million goulash tin cans for the military service: some of them are decomposed (red balls), some not. 72 21B3.1. Introduction • Various patients treated by XYZ: some can be healed (red balls), some not (white balls). • Tossing a coin: each ball corresponds to a toss of a coin, red balls correspond to “head”, white balls to “tail”. Please note that here we are dealing with a very large (infinitely large) number of coin tosses. The urn model with balls of two colors corresponds to a so-called binary outcome. This model can be extended to nominal outcomes, by introducing more colors. Also for ordinal or scale outcomes we could think of similar models. Although the inspection of goulash cans, the examination of efficacy of clinical treatments or the coin toss are not comparable by their matter, they have a related structure formally, which can be represented by the urn model. How can the proportion of red (white) balls be found: • • • Examine all balls: this will be the only way to think of with blood bottles. Examine only a part: for goulash tins, treatment XYZ, and coin toss Examine none (use knowledge from other sources): for blood bottles, goulash cans, treatment XYZ, coin toss From now on, we will deal with the second option, the examination of a part (a sample), to draw conclusions that are valid for the whole (the population). We proceed the following way: (1) Draw a sample from the population (2) Determine the required features (3) Draw conclusions from the results of the sample to the properties of the population Ad (1). Drawing a sample corresponds to drawing balls out of the urn, selecting goulash tins for quality control, recruiting patients for a clinical trial or to coin tosses. The crucial point with a sample is that it is representative for the population, which can be achieved by a random selection. Please note: only results from representative samples can be generalized to the population! Results that cannot be generalized are mostly uninteresting. Ad (2). The determination of the required features is often difficult. It will not be much of a problem to determine the color of a ball, but can be more tricky in the goulash or treatment success examples, particularly 73 21B3.1. Introduction with borderline cases. Therefore, a good study protocol should contain exact and comprehensible definitions that are free from contradiction. Ad (3). How can we draw conclusions from a sample to properties of a population? This difficult and very fundamental problem will be dealt with in the following chapters. Note: We will even extend the problem by considering two representative samples, which should be used to answer the question whether the underlying populations are the same or not. This corresponds to a situation of two urns from which a particular number of balls are drawn in order to decide whether the proportion of red balls is the same in both urns or not. In clinical context this corresponds to the comparison of two treatments: by using treatments ABC-NEW, are we able to heal more patients than by using XYZ? First of all, we will flip sides. Assume that we know the characteristics of a population. What happens if we draw a random sample from it? This will be the content of the next chapter, “Probability theory”. 74 22B3.2. Probability theory 3.2. Probability theory An experiment experiment. with unpredictable outcome is called a random Using this definition, we can call the application of a clinical treatment a random experiment, because we do not know in advance the outcome of the therapy (e. g., can the patient be healed? How long will the patient survive? How much can the drug lower the blood pressure?). The set of all possible outcomes of a random experiment (the set of all elementary events) constitutes the sample space. Based on the sample space, we can define random events. • Random experiment: rolling a dice Sample space: {1, 2, 3, 4, 5, 6} Event A … “2”, A = {2} Event B … an even number, B={2, 4, 6} Event C … a number greater than 3, C= {4, 5, 6} • Random experiment: opinion poll Sample space: {Preferences of all Austrians older than 14 years with respect to food and stimulants} Event D … person is smoker Event E … person prefers bread and margarine over bread and butter Note: These two outcomes appear very simple at first glance, but are defined very loosely. When is a person a smoker? Is it somebody smoking one cigarette once a week? Is somebody a non-smoker, if he stopped smoking three days ago? Outcome E is even more unclear: what about a person who dislikes margarine and butter? What about a person who prefers margarine over butter, but dislikes bread? What is the definition of margarine, is it a certain brand or the product in general? • Random experiment: survival of Ewing’s sarcoma patients after chemotherapy Sample space: {all real numbers between 0 and 130 years} Event F … patient dies within the first year Event G … patient survives longer than 15 years 75 22B3.2. Probability theory • Random experiment: sleeping duration within 24 hrs Sample space: {all real numbers between 0 and 24} Event H … less than 8 hrs, H = [0,8) Event I … abnormal sleeping duration, i. e. less than 4 hrs or more than 11 hrs, I = [0,4} (11,24] The complementary event of an event A is the event that A will not happen, i. e., it consists of all elements of the sample space that are not covered by the event A. Complementary event of B: an odd number, BC = {1, 3, 5} Complementary event of E: EC = {person does not prefer bread and margarine over bread and butter} A certain event is an event which is known a priori to be certain to occur, e. g., rolling a dice and obtaining a number less than or equal to 6, S={1, 2, 3, 4, 5, 6} An impossible event is an event which is known a priori to be certain to not occur, e. g., rolling a dice and obtaining a number between 19 and 25, U={}. Union of two events is the event that at least one of the events occurs Union of B and C: the number is even, or greater than 3, or both. B C = {2, 4, 5, 6} The event I (abnormal sleeping duration) is the union of two events K and L: I = K L Event K … too short sleeping duration, K=[0,4) Event L … too long sleeping duration, K=(11,24] The intersection of two events is the event that both events occur. Intersection of B and C, i. e. the number is even and greater than 3: B C = {4, 6} What is the intersection of K and L? It is an impossible event, because it is impossible to sleep too long and too short at the same time (literally!). We say, the events K and L are mutually exclusive or disjoint events. The intersection of DC and IC: a person who does not smoke and sleeps between 4 and 11 hrs. 76 22B3.2. Probability theory After defining events, we can proceed to compute probabilities. Example 3.2.1: the probability of the outcome “2” when rolling a fair dice, is 1/6. Example 3.2.2: the probability of a male birth is 51.4%. In both examples above there is a strong relationship to “relative frequency”. Ad example 3.2.1: from physical considerations we know, that a fair dice has the same probability for each side. Therefore, we assume that when rolling the dice repeatedly, the relative frequency of “2” approaches the mathematically computed number of 1/6. The calculation of this mathematical probability follows the following formula: Number of favorable elementary events g 1 = = m 6 Total number of elementary events where a “favorable elementary event” is an elementary event that is an element of the event of interest. This formula is also called Laplace probability (Pierre Simon Marquis de Laplace, 1749-1827) or the classical definition of probability. This definition is only reasonable if elementary events are assigned equal probabilities. Therefore, we can apply it also to a coin flip. Ad example 3.2.2: this proposition is based on observation. After observing for a long time, that the relative frequency of a male birth is about 51.4%, we assume that this empirical law also applies to future births. Therefore, we call the definition of probability used here the statistical definition of probability; the relative frequency in a very, very long series of experiments. Difference between relative frequency and probability: • • • Relative frequency: property of a sample Probability: property of a population, it refers to a future nonpredictable event Probability is the expectation of relative frequency. Please note: Probabilities are always tied to events. If a probability is stated somewhere, it must be assured, that the related event is defined uniquely and comprehensibly. This is exemplified by the following: • Gigerenzer, 2002: a psychiatrist regularly administered the drug Prozac to depressive patients. He informed the patients, that sexual 77 22B3.2. Probability theory problems will occur with a probability of 30 to 50%. He meant that for 3 to 5 patients out of 10, sexual problems will occur. However, many patients thought that in 30 to 50 per cent of their sexual activities, problems will occur. • Randow, 1992: after the gulf war of 1991, the US Navy reported a 99% success rate of their Cruise Missiles. After inquiries, it turned out that 99% of the rockets could be started without problems. It was not a statement about the hit rate. • Nowadays, better diagnostic procedures are able to detect metastases in lung cancer patients that could not be detected formerly. Therefore, such patients are now classified into the stage “bad” instead of the earlier classification “good”. This phenomenon is often called stage migration. As a consequence, the survival probabilities in both stages increased, although nothing changed with a patient’s individual outcome. The reason for the change in survival probabilities lies in the migration of patients from the stage “good” to the stage “bad”; these patients had poor outcome when compared to others of the “good” stage, but better outcome when compared to “bad” stage. Mathematically, three properties suffice for probability calculus: I. The probability of an event is a number between 0 and 1 (0 and 100 per cent). II. Probability 1 (100%) is assigned to the certain event. III. The probability of an event that is the union of mutually exclusive events is the sum of probabilities corresponding to these events. These properties are also called Komogorov’s axioms, where Kolmogorov put a little more mathematical effort in the formulation of his third axiom. Ad I and II. Following this axiom, probability is just a measure quantifying the possible occurrence of an event. As the certain event always occurs, it is assigned the highest possible probability (1). The range from 0 to 1 is used just for convenience, we could use any other range (e.g., -94.23 to 2490.08), but we would be silly to do so. Ad III. This is a simple and plausible rule. Consider rolling a fair dice: if the probability for the outcome “2” is 1/6, and the probability of rolling “3” is also 1/6, then the probability of rolling “2” or “3” is 1/6+1/6 = 2/6. 78 22B3.2. Probability theory These events are mutually exclusive, because we cannot roll a “2” and a “3” at the same time. Ad I-III. All calculation rules concerning probabilities can be deducted from these three properties. Often the letter “P” is used to denote probabilities. When beginning with the calculation of probabilities, one tends to use phrases like “the probability of a male birth is 0.514 or 51.4%” to describe the results. After some practice this seems cumbersome and one seeks abbreviations like “P(M)=0.514”. In the following, we will mainly use abbreviated notation. The probability of the complementary event Assume that the probability of an event X is known and denoted by P(X). One minus this probability gives us the probability of the complementary event: P(XC) = 1 – P(X) In order to visualize this probability calculation, one can use Venn diagrams. The sample space (certain event) is drawn as a rectangle. Events are represented as ellipses within the sample space. In our case the event X is represented by a gray-colored ellipse. The probability of X is the proportion of the area covered by the ellipse from the rectangular area. 79 22B3.2. Probability theory The area outside the ellipse represents the complementary event: Example: the probability of a male birth is 0.514, i. e. the probability of a female birth is 1 – 0.514 = 0.486 or 48.6%. Probability of the union of two events which are not mutually exclusive The probability of the union of two events which are not mutually exclusive is computed by summing up the probabilities of the events and substracting the probability of the intersection of the two events (otherwise, the probability of the intersection would be counted twice): P(X Y) = P(X) + P(Y) – P(X Y) The probability is again represented by a Venn diagram: 80 22B3.2. Probability theory Example: dice flip: the probability of the event that we obtain an even number or a number greater than 3 P(B C) = P(B) + P(C) – P(B C) = 3 3 2 4 + − = 6 6 6 6 Conditional probabilities In daily life, our assessment of particular circumstances may change as soon as we obtain new information. In probability theory, this observation is reflected by the concept of conditional probability. As an example, consider the probability of a multiple birth, which is in general very small. Our assessment will change as soon as we know that the probability refers to a woman who underwent hormonal treatment. The conditional probability is the probability of an event when you know that the outcome was in some particular part of the sample space. Put another way, it is the probability of an event when you know that another outcome has already occurred. Mathematically, it is written as P(X|Y) … the conditional probability of X given Y By means of a Venn diagram we can easily visualize that P(X|Y) = P(XY)/P(Y), i. e., P(XY) = P(Y) P(X|Y) Example: in diagnostic studies clinical, radiological or laboratory procedures are evaluated if they are suitable to detect particular diseases. We use terms like sensitivity and specificity to quantify this suitability. These are basically conditional probabilities. In general, a diagnostic test for a condition is said to be positive if it states that the condition is present and negative if it states that the condition is absent. Sensitivity is defined as the probability that the test states the condition is present, given it is actually present. We are dealing with two events: Event1={test is positive} Event2={condition is actually present} The sensitivity can thus be written as conditional probability: Sensitivity=P(Event1|Event2) 81 22B3.2. Probability theory Specificity is defined as the probability that the test states the condition is absent, given it is actually absent. Now we are dealing with the complementary events: Event1C={test is negative} Event2C={condition is absent} Specificity=P(Event1C|Event2C) In medical applications special interest is given to the positive predictive value. It is defined as the probability that the condition is present, given the diagnostic test is positive. There is an analogue referring to the complementary events: the negative predictive value. It is defined as the probability that the condition is absent, given the test states that it is absent. Independence of events This concept also applies to daily life. Loosely speaking, independence of events means that the events do not influence each other. This means that occurrence of one event does not change the probability of the other event. As an example consider once more the toss of a dice. The event that a “2” is rolled is independent from the event that in the toss before a number greater than 3 or lower than 3 was obtained. Another example: assume you read in a scientific journal, that eating herbal drops and a subsequent healing success are statistically independent events. This is just a scientifically correct explanation for the fact that a patient suffering from a particular disease will not be healed if he or she takes herbal drops. Events X and Y are independent, if P(X|Y) = P(X) or P(X|Y) = P(X|YC) or P(XY) = P(X)P(Y) … this is also called the multiplication rule. The opposite of independence is dependence. As an example, consider the event “multiple birth”, its probability depends on the occurrence of the event “hormonal treatment”. Note: we distinguish stochastic and causal dependence. While stochastic dependence is symmetrical, causal dependence is a one-way relationship between events. Clearly, multiple birth is causally dependent on hormonal 82 22B3.2. Probability theory treatment, because a reverse relationship doesn’t make sense. Causal dependence means that the occurrence of a cause changes the probability of an effect. This does not mean necessarily, that the effect follows directly from the cause. Many dependencies that can be observed have to be declared as preliminarily stochastic, just because the lacking substance matter knowledge does not permit a causal explanation. The stochastic dependency may turn into a causal dependency if we later obtain additional knowledge. When computing conditional probabilities you don’t have to take care about causal dependencies. We are allowed to calculate both P(effect|cause) and P(cause|effect). The latter, i. e. inferring the probability of a cause given a particular effect has been observed, is typical for private investigators, historians and forensic doctors. Example 3.2.3: Consider the game “ludo” (in german: “Mensch-ärgere-Dichnicht”). A player must throw a 6 to move a piece from the starting circle onto the first square on the track. What is the probability not to throw a “6” in three consecutive throws? We start with the definition of three events: E1 … no “6” at the first throw E2 … no “6” at the second throw E3 … no “6” at the third throw 83 22B3.2. Probability theory We know that the throws are independent. Therefore, we can make use of the multiplication rule: P(no “6” in three consecutive throws) = P(E1E2E3)=P(E1)P(E2)P(E3) What we need now is the probability to throw no “6” at one throw. Clearly, it is 5/6. Thus, P(no “6” in three consecutive throws) = 5/6 x 5/6 x 5/6 = 125/216 = 0.5787. Example 3.2.4: Patients may suffer from infectious diseases without showing symptoms (asymptomatic infections). Assume that for a particular disease, the probability of an asymptomatic infection is 40%. Consider two patients that are infected. What is the probability that (a) both infections are asymptomatic (b) both infections are symptomatic (c) exactly one infection is symptomatic? a) P(asymptomatic and asymptomatic) b) P(symptomatic and symptomatic) c) P(symptomatic and asymptomatic) = 0.4 * 0.4 = = 0.6 * 0.6 = = 0.4 * 0.6 + 0.6 * 0.4= 0.16 0.36 0.48 1.00 Ad c) The event “exactly one infection is symptomatic” consists of two mutually exclusive events. These are “first infection is asymptomatic and second infection is symptomatic” and “first infection is symptomatic and second infection is asymptomatic”. Example 3.2.5: Everyone has two copies of each gene (one passed over from the mother and one from the father). Each gene may appear in different alleles. As an example, consider the ABO gene which determines blood groups. There are three main alleles (A, B and O). The relative population frequencies of different alleles of a gene are called allele frequencies (gene frequencies), where each individuals contributes two alleles to the population of alleles. In a Caucasian population the allele frequencies of the blood group alleles are: P(A)=0.28, P(B)=0.06, P(O)=0.66 An individual's genotype for a gene is the pair of alleles it happens to possess. With the blood group gene, we know of six genotypes, these are AA, AB, AO, BB, BO, OO, where BA, OA and OB are the same as AB, AO and BO, respectively. The phenotypes of the blood groups (A, B, AB, O) are constituted by the different genotypes, where A and B are dominant over O, and A and B are mutually codominant. Thus, phenotype A is constituted by the genotypes AA 84 22B3.2. Probability theory and AO, phenotype B by genotypes BB and BO, phenotype AB by genotype AB, and phenotype O by the genotype OO. Each parent transmits each of his/her pair of alleles with probability ½. The transmitted alleles constitute the genotype (and the phenotype) of the offspring. In a large population • • • • with random mating (each individual has the same probability to mate with everybody), without migration without mutation and without selection (influence of the genotype on fertility of the individual and on the viability of his/her offspring) the allele frequencies and the genotype frequencies are constant over generations. Such a population is said to be in Hardy-Weinberg-equilibrium, and the genotype frequencies are easily computed from the allele frequencies (guess why?). Assume a Caucasian population in Hardy-Weinberg-equilibrium. What are the genotype frequencies of the blood group gene in this population? P(genotype AA) P(genotype AB) P(genotype AO) P(genotype(BB) P(genotype BO) P(genotype OO) = = = = = = P(allele A) * P(allele A) 2 * P(allele A) * P(allele B) 2 * P(allele A) * P(allele O) P(allele B) * P(allele B) 2 * P(allele B) * P(allele O) P(allele O) * P(allele O) = = = = = = 0.0784 0.0336 0.3696 0.0036 0.0792 0.4356 1.0000 Drawing with or without replacement What is the probability for “6 out of 6” in a lottery with 45 numbers? Hint: consider the funnel containing 45 balls, with 6 balls colored red (these refer to the numbers you have bet on), the other 39 balls are white. What is the probability to draw six red balls in six consecutive draws? It is crucial that the balls are not returned to the funnel after being drawn (such that the same number cannot appear twice among the winning numbers). At the first draw, the probability of drawing a red ball, P(R1) = 6/45 = 0.1333. At the second draw, the probability of drawing a red ball depends whether a red or a white ball was drawn at the first draw: P(R2|R1) = 5/44 = 85 22B3.2. Probability theory 0.1136, and P(R2|R1C) = 6/44 = 0.1364. This means, that the consecutive draws are no longer independent. The same argument applies for the probability of the third, fourth, firth and sixth draw. The probability for a “6 out of 6” is therefore not so easy to compute. At last, it is given by P(6 out of 6) = P(R1) * P(R2|R1) * P(R3|R1R2) * P(R4|R1R2R3) P(R5|R1R2R3R4) * P(R6|R1R2R3R4R5) = 6/45 * 5/44 * 4/43 * 3/42 * 2/41 * 1/40 = 0.000000123. * Note: when drawing without replacement from a large population, then the effect on the probabilities for the next draw are usually very low, such that it can be ignored in computations. As an example, consider 5000 balls among which there are 1000 red balls. At the first draw the probability to draw a red ball is P(R) = 1000/5000 = 0.2. At the second draw, the probability to draw a red ball (without replacing the first ball) is now either P(R|R) = 999/4999 = 0.199984 or P(R|W) = 1000/4999 = 0.20004. On the other hand, the probability for drawing a red ball with replacement of the first ball is 0.2, which is very close to both values computed assuming no replacement. Probability distributions A lot of problems follow the same scheme when computing probabilities, although they are completely different in their nature. Three examples are: • A number of n individuals not related to each other suffer from an infection. The probability p of an asymptomatic infection is known. What is the probability that exactly x (or equal to or less than x) infections are asymptomatic? • An experienced surgeon operates on n not related patients on different days. The probability p for a complication during the routine operation is known. What is the probability that a complication will occur in exactly x (or in x or less) patients? • In total n not related patients suffering from mushroom poisoning after eating death cap are admitted into the Vienna General Hospital. Let p denote the probability that a patient survives such a poisoning. What is the probability that exactly x (or at least x) patients survive the poisoning? In order to compute these probabilities it is not necessary to reinvent the wheel whensoever. All these questions raised above can be answered by 86 22B3.2. Probability theory making use of the so-called binomial distribution. The probability for n=18 and a “success” probability of p=0.35 have been tabled on the next page. By a “success”, we mean “asymptomatic infection”, “complication” or “survival after poisoning”. Question: Why do we have to pay special attention to phrasings like “experienced surgeon”, “not related individuals/patients”, “on different days”? x ... Number of successes p … Success probability Cumulative probability 0 0.00043 0.00043 1 0.00416 0.00459 2 0.01903 0.02362 3 0.05465 0.07827 4 0.11035 0.18862 5 0.16638 0.35500 6 0.19411 0.54910 7 0.17918 0.72828 8 0.13266 0.86094 9 0.07937 0.94031 10 0.03846 0.97877 11 0.01506 0.99383 12 0.00473 0.99856 13 0.00118 0.99974 14 0.00023 0.99996 15 0.00003 0.99999... 16 0.00000... 0.99999... 17 0.00000... 0.99999... 18 0.00000... 1.00000 How to read this table: • If the success probability is 35%, then the probability of exactly 7 successes in 18 trials is 17.9%. • The probability of at most 3 successes in 18 trials, is 7.83%. • The probability of at least 10 successes in 18 trials, is 5.97% (Note: computed via the complementary probability. The probability of at 87 22B3.2. Probability theory most 9 successes is 94.03%, thus the probability of at least 10 successes is 100% - 94.03% = 5.97%.). The table can also be visualized by a diagram showing the probability function: Probability function Binomial distribution 0.20 0.19 0.18 0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Number of successes Please note that though in the constellation of our example it is possible to yield 15 or more successes in 18 trials, this is a highly implausible event. If we had succeeded in 17 out of 18 trials, then there are three potential explanations: a) this is a truly fortunate coincidence b) the success probability assumed to be 35% is wrong c) other assumptions in our computations do not hold, e. g., the trials are not independent from each other By the way, we will revisit these considerations at various other occasions during the course. The principle of statistical testing is based on these central and basic considerations. 88 22B3.2. Probability theory Cumulative probabilities are often called distribution function in statistics. (Note for students who already came across with Kaplan-Meier-curves: these are kind of reverse distribution functions.) Binomial distribution 1.0 Distribution function 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Number of successes The binomial distribution is a discrete distribution. Other discrete distributions comprise: • • • • • • • Binomial distribution: how probable are x successes in n trials, keeping the success probability p constant Hypergeometric distribution: how probable are x successes in n trials if drawn without replacement (lottery) Multinomial distribution: generalization of binomial distribution to more than two possible conditions Poisson distribution: describes very rare events (e. g., in epidemiology) Geometric distribution: how many trials n are needed to yield the first success, with constant success probability p Negative binomial distribution: how many trials n are needed to yield x successes with constant success probability p Discrete uniform distribution: e. g., rolling a fair dice Beneath discrete distributions there is the group of continuous distributions. We can apply a continuous distribution for the sleeping 89 22B3.2. Probability theory duration example, in which the point probability of a particular sleeping duration is of no importance at all; nobody cares about the probability that somebody sleeps exactly 8 hours 34 minutes 17.2348234230980901 seconds. We are only interested in probabilities for intervals, e. g., what is the probability that somebody • • sleeps longer than 11 hours? sleeps between 8 and 9.5 hours? With continuous distributions, there is no probability function, because on the one hand there are so many individual points, and on the other hand each of these points is assigned an infinitely small probability (“virtually zero”). Moreover, only probabilities for intervals are of interest (see above). Note: The transition from discrete to continuous distributions is one small step for us. We do not have to bother about the fact that it is one giant leap for mathematicians dealing with measure theory. The analogue to the probability function with continuous distributions is the so-called density function. Probabilities assigned to intervals are derived by computing the area under the density function. The cumulative density function is again called distribution function. The following diagrams show the distribution and density functions for the most important continuous distribution, the normal distribution. Distribution function of the normal (Gaussian) distribution 1 0,5 0 90 22B3.2. Probability theory Density of the normal (Gaussian) distribution The normal distribution is sometimes also called Gaussian distribution (Carl Friedrich Gauß, a German mathematician, 1777-1855). There are several other distributions beside the normal distribution. Among them the best-known are the t-distribution, the chi-squaredistribution, the F-distribution, the gamma distribution, the lognormal distribution, the exponential distribution, the Weibull distribution, the continuous uniform distribution and the beta distribution. Each of them has (similarly to discrete distributions) its classical applications. Later, we will make use of t-, chi-square- and F-distributions. Why is the normal distribution of special importance? • The results of many experiments in biological science are – at least approximately – normally distributed. • The distribution of the mean and the sum of random variables approaches the normal distribution with increasing sample size (the central limit theorem). • Other known distributions stem from the normal distribution. As an example, the chi-square-distribution can be seen as the distribution of the sum of squared normally distributed random variables. Note: the denomination “normal distribution” doesn’t mean that this distribution is the normal case, the most common distribution, or a kind of standard. Relationship between probability theory and descriptive statistics Density function and probability function are strongly related to histogram and bar chart, respectively. While these functions describe theoretical 91 22B3.2. Probability theory thoughts or the population, histogram and bar chart are used to describe sample results. The following already known concepts can be used for all theoretical distributions, irrespective if discrete or continuous: • Expectation: this is the theoretical mean, the mean of the population, the “true” mean • Theoretical variance: the expected mean square deviation from the expectation, the “true” variance • These considerations also hold for statistics like median, other quantiles, the skewness, etc. The most important difference between an expectation and an empirically determined mean is the following: the expectation is the true value in a population, it is fixed and immovable. The empirical mean is a value computed from random results (from a sample), thus it is a consequence of randomness and possesses a probability distribution which can be studied. The same holds for the comparison of all other theoretical and empirical statistical measures. The law of large numbers and the central limit theorem Now let’s consider two important basics of probability theory which concern the mean computed from a sample. To keep it simple, let’s assume the sample consists of independent identically distributed random numbers. The law of large numbers: The empirical mean computed from a sample approaches the expectation (the population mean) with increasing sample size. By the way it should be noted that the theoretical variation of the empirical mean decreases with increasing sample size n (exactly by a factor of n ). Intuitively, the law of large numbers is easy to understand: the more we measure, i. e., the larger the sample, the more exact and precise are our results. Example: In a scientific meeting two high-quality studies concerning the same topic are presented. The only difference lies in the sample size: 92 22B3.2. Probability theory while in study A 30 patients had been recruited, study B is based on 200 patients. By nature we will consider the results of study B as more plausible than those of study A, just because it is more probable that the larger study B is closer to the truth than study A The central limit theorem: The distribution of the empirical mean converges to a normal distribution with increasing sample size 93 22B3.2. Probability theory This statement is very useful for various statistical methods. However, it is difficult to follow intuitively. For students interested in the topic it is recommended to verify the central limit theorem by dice experiments or computer simulation. There are some Java-applets around demonstrating the central limit theorem: http://medweb.unimuenster.de/institute/imib/lehre/skripte/biomathe/bio/grza2.html http://statistik.wuwien.ac.at/mathstat/hatz/vo/applets/cenlimit/cenlim.html Summary Probability theory enables us to draw conclusions about samples, given we know the truth (the population characteristics). And we are able to evaluate “what if” scenarios. We will often come back to that possibility later in the course. Our next goal In empirical clinical research we don’t know the truth (the population), otherwise we wouldn’t have to do research. Usually only a sample limited in size is available, which is used to draw conclusions about the characteristics of the population. Since the sample size is limited we cannot draw conclusions with absolute certainty, but we can make use of probability theory and “what-if” scenarios to quantify the inevitably inherent uncertainty. 94 23B3.3. Exercises 3.3. Exercises 3.3.1. Consider a delivery of goulash tins. Why doesn’t a complete examination of all goulash tins make sense? Name medical examples where a complete examination is not reasonable. 3.3.2. Assume that among all goulash tins there are exactly five with spoilt contents. A sample of 20 is drawn, among which the five spoilt tins are found. Is this sample still representative for the population of all goulash tins? 3.3.3. Name some conditions under which the patient selection in a clinical study yields a representative sample. 3.3.4. Sometimes probabilities are indicated as odds. In opposite to the classical definition of probabilities, where a ratio of “favorable” and “possible” events is computed, odds are defined as the ratio of “favorable” and “unfavorable” events. Thus, odds can assume any numbers greater than or equal to zero, not bounded above. As an example, consider the odds for a male birth: they are 0.514/0.486 = 1.0576. Compute the odds for a female birth. Assume that prior to the beginning of the European Soccer Championship 2008 in Austria and Switzerland, the odds that Switzerland will win the games is 2:7. What is the probability of a European Soccer Champion Switzerland? 3.3.5. A newly developed diagnostic test for HIV discovers 98% of the actually infected patients. On the other hand, it will also classify 5% of persons not infected by HIV as infected. Compute sensitivity and specificity of this test. Compute the positive and the negative predictive value of the test. 3.3.6. Are in general two mutually exclusive events independent from each other? (Try to solve by finding an appropriate example.) 3.3.7. During a scientific meeting a visit of a casino is offered. At the roulette table you meet an old friend which seems to linger there already for some time. He is very excited to report that the roulette ball has fallen on a red number for eight times in series. Therefore, he suggests to bet on “black”, because it is much more probable now. He does so and the ball really lands on a black number. What do you think about that? 95 23B3.3. Exercises 3.3.8. (from Gigerenzer, 2002) A Bavarian minister of the interior spoke out on the dangers of drug abuse and explained that since most of the heroin addicted have been using Marihuana before, it follows that most Marihuana users will become heroin addicted later. Appraise the conclusion of the Bavarian minister of the interior from a probability theory view. (Instructions: Define the corresponding events, identify the conditional probabilities used in the conclusion. Use Venn diagrams.) 3.3.9. The course of infectious diseases may be asymptomatic. Assume the probability of an asymptomatic course of a particular disease is 40%. Further assume that three persons have been infected. What is the probability for (a) three asymptomatic infections, (b) two asymptomatic infections and one symptomatic infection, (c) one asymptomatic infection and two symptomatic infections, (d) three symptomatic infections? Assume you happen to know that the three infected persons are closely related. Could this information affect the validity of the computations of (a) to (d)? 3.3.10. Assume the Hardy-Weinberg-conditions apply. Compute the phenotype probabilities of the blood groups in a Caucasian population. What is the probability that in this population two randomly selected persons have the same blood group (the same phenotype)? What is the probability that in this population a biologic pair of parents (both with phenotype A) have an offspring with phenotype O? What is the probability that in this population a biologic pair of parents (both with phenotype O) have an offspring with phenotype A? 3.3.11. What are the probabilities in a lottery with 45 numbers (e. g., “Lotto 6 aus 45”) to have 0, 1, 2, 3, 4, 5, 6 out of 6? Also compute the probability for 5 out of 6 considering the bonus ball (“Zusatzzahl”; this means that you bet on 5 of the 6 winning numbers and on the bonus ball). 96 23B3.3. Exercises 3.3.12. Assume that two therapies A (the standard treatment) and B (a new treatment) exist to treat a particular type of cold. Treatment by A yields a success rate of pA=24%, and treatment by B yields pB=26%. The number needed to treat (NNT) to describe an advantage of B over A is defined as follows: NNT = 1/(pB-pA) = 1/(0.26-0.24) = 1/0.02 = 50 (Please note that percentages have to be given as proportions!) Can you interpret this result? What would happen if pA=26% and pB=24%? Compute the NNT for pA=3.5% and pB=16.1%. Interpret the result. Assume treatment A would always and B never lead to a success. Compute and interpret the NNT. Assume both treatments have the same success rate. Compute and interpret the NNT. A colleague tells you that the NNT of Acetylsalicyclic acid (Aspirin) in treating headache is 4.2 patients. Are you contented with this proposition? 97 24B4.1. Principle of statistical tests Chapter 4 Statistical Tests I 4.1. Principle of statistical tests Example 4.1.1.: (scale variable) Animal experiment of Dr. X: two groups of female rats receive food with high and low protein Research question: Is there a difference between groups for the weight gain (from an age of 28 to 84 days)? Result (weight gain in gram): Group 1 (high protein): 134, 146, 104, 119, 124, 161, 107, 83, 113, 129, 97, 123 Group 2 (low protein): 70, 118, 101, 85, 107, 132, 94 Analysis from Dr. X: mean weight gain in group 1 (high protein) 120 g mean weight gain in group 2 (low protein) 101 g Conclusion from Dr. X: High protein in the food of young female rats results in higher weight gain than food with low protein. (Generalization from the sample to the population!) Pretension of colleague Y: Differences are due to CHANCE allone! In reality both diets cause identical weight gains. The conclusion above is worthless. 98 24B4.1. Principle of statistical tests On the one hand: Colleague Y could be right! Are the results pure chance? On the other hand: Every conclusion can be doubted with the argument „everything is chance“! What to do? Conclusion: If we want to come to a decision then the possibility of a false positive answer has to be accepted! The statisticans call this case type I error or α-error. Concretely in our example: If we draw the conclusion that food with different protein-contents leads to different weight gains, then this decision could be a wrong decision. This is an inevitable fact. In other words: If we come to a positive answer in an empirical study, then we have to live with the uncertainty that it could be a false positive answer! So, clever colleagues could decleare every unimportant result as empirical proof for their research hypothesis and counter criticism with „I always accept the possibility of a false positive answer!”. Consequently one has to conclude, that a potential false-positive decision should not be made arbitrarily often and also not to one’s customized disposal. If there is the possibility for a false-positive answer then such an answer should • • be rare be made only in situations, where the „pure chance“-argument seems to be implausible due to the observed data. The considerations above can be formalised, and a statistical test results 99 24B4.1. Principle of statistical tests Principle of a statistical test: • Define a null hypothesis, usually: „There is no difference!“ In example 4.1.1. the null hypothesis corresponds with the view of the critical colleague Y. Remark: The null hypothesis if often the negation of the research question. • Alternative hypothesis: „There is a difference!“ Of course, this is a very broad statement. But initially we are satisfied with this. • Assume the null hypothesis applies. Calculate the probability to observe the actual (or a more extreme) result. This probability is called p-value. Nowadays, it is calculated nearly exclusively by computer programs. Remind: the p-value does not correspond with the probability that the null hypothesis is true. The p-value can rather be seen as conditional probability, where the condition is the assumed validity of the null hypothesis. Remark: A mathematical statistician will be against the statement „the pvalue is a conditional probability“ in the same way as a phycisian is against the description of a broken fibula as a „broken foot”. Both statements are acceptable in a broad sense for non-medic people, but incorrect for professional terminology. Only for interested people (and only for them): The p-value is based on a restriction of the parameter space and not on a restriction of the sample space (as it would be necessary for the definition of a conditional probability). • If the calculated p-value is smaller than or equal to 0.05, then the null hypothesis is rejected and the result is called “statistically significant”. The limit of 5 % is called significance level and often the greek letter α is used as its abbreviation. Remark: Every value between 0 and 1 could theoretically be used as significance level but only small values like 0.01, 0.05 or 0.10 are useful. In medicine, the value 0.05 has been established as a standard. 100 24B4.1. Principle of statistical tests A statistical test corresponds to a „what-if“ scenario: 1. What if the null hypothesis was true? 2. Can our data be explained in a plausible way? 3. The p-value measures the degree of plausibility. 4. If the p-value is „small“ then the null hypothesis is not a plausible explanation for our observed data. 5. There „must be“ another explanation (alternative hypothesis!) and we „reject” the null hypothesis. 101 25B4.2. t-test 4.2. t-test Back to the rat diet example 4.1.1.: If a p-value was smaller than 5%, then Dr. X could claim that the variation in the protein-diet may lead to differences in the weight gain. The objection of the sceptical colleague Y would be disproved. Remark: In reality colleague Y could be right anyway, but Dr. X could “claim” his observed differences by a significant result. The messages above are in subjunctive, as they are only valid in case of a significant result. How can we now detect a potential significant difference in the rat diet example? One possibility is offered by the t-test for independent samples (also called unpaired t-test). The null hypothesis is: „No difference in weight gain between the two protein-diets.“ Calculation Assume that the null hypothesis is true. Calculate the probability to observe the current or a more extreme result. What is a "more extreme result“? Every result farther away from the null hypothesis than the current result. Wanted: We need an intuitive measure to detect deviations from the null hypothesis. A suitable measure appears to be the difference of the means between the two groups. The bigger the difference between the two means the more exteme is the result. Or in other words: the further we are away from the null hypothesis. Concretely, for example 4.1.1.: Consider that the two diets do not cause unequal weight gains. Calculate the probability to observe an absolute mean difference in the weights of 19 gramms or higher by chance (both directions are interesting – two sided test). 102 25B4.2. t-test If • data approximately follow a normal distribution, and • if standard deviations within the groups are approximately equal, then the mean difference divided by its estimated variation („standard error“) follows a t-distribuion under the null hypothesis. This quotient is also named test statistic. Why is the observed difference in means divided by its variation? First, to obtain a general measure without dimension and second, to take into acount that the relative importance of a difference of e.g. 19 grams depends on the population variation. If the conditions mentioned above apply then we can use the t-distribution to calculate the p-value. Nowadays, this is nearly exclusively performed by means of computer programs. Use of SPSS Now we can use SPSS to analyze the example. I.) First, the data have to be entered II.) Then we visualize the situation by graphical and/or other descriptive tools III.) By means of the results from II) we verify the assumptions of the intended test procedure IV.) Only after having verified the assumptions we calculate the test V.) We interprete the results ad I.) For the rat diet example 4.1.1. we generate two variables, GROUP and GAIN: GROUP is a binary variable to discriminate the protein content in the food of the rats (1=high protein, 2=low protein) GAIN is a scale variable containing the weight gain in grams 103 25B4.2. t-test ad II.) Boxplots are used for visualization. ad III.) In both groups, the weight gain is symmetrically distributed without any outliers. The spread (standard deviations) is approximately equal in both groups. The usage of the planned t-test is justified. ad IV.) To compute the t-test we choose the menu Analyze Compare Means Independent-Samples T Test... We move the variable GAIN to the field Test Variable(s): and the variable GROUP to the field Grouping Variable: Beneath the GROUP variable two question marks appear. We click on the button Define Groups... 104 25B4.2. t-test and enter the value 1 into the field Group 1: and the value 2 into the field Group 2: These are the group codes discriminating low and high protein groups. Then we click on Continue and OK to calculate the requested t-test. First we receive the descriptive measures per group. Group Statistics Weight gain (day 28 to 84) Dietary group high protein diet low protein diet N Mean Std. Deviation Std. Error Mean 12 120.000 21.3882 6.1742 7 101.000 20.6236 7.7950 The test result is shown in a very wide and thus confusing table, which was splitted into three parts here for easier understanding. Important parts have been highlighted by grey background. Independent Samples Test Levene's Test for Equality of Variances F Weight gain (day 28 to 84) Equal variances assumed Equal variances not assumed .015 105 Sig. .905 25B4.2. t-test Weight gain (day 28 to 84) Equal variances assumed Equal variances not assumed t-test for Equality of Means Sig. (2Mean df tailed) Difference t 1.891 17 .076 19.0000 1.911 13.082 .078 19.0000 Std. Error Difference Weight gain (day 28 to 84) Equal variances assumed Equal variances not assumed 95% Confidence Interval of the Difference Lower Upper 10.0453 -2.1937 40.1937 9.9440 -2.4691 40.4691 The "Mean Difference" of 19 divided by its variation ("standard error difference") of 10.0453 results in the observed test statistic value "t" of 1.891. As the test statistic follows a t-distribution with 17 degrees of freedom (“df”) under the null hypothesis, the p-value (“Sig. (2-tailed)“) calculates to 0.076. ad V.) Using the common significance level of 5 % it follows that the null hypothesis of equality of the means cannot be rejected. Dr. X cannot invalidate the „pure chance“ objection of Dr. Y! 106 26B4.3. Wilcoxon rank-sum test 4.3. Wilcoxon rank-sum test The Wilcoxon rank-sum test is a non-parametric test. It is equivalent to the Mann-Whitney U-test. Thus, it is also often called Wilcoxon-MannWhitney U-Test. We will get acquainted with this test by applying it to the data of example 4.1.1. Null hypothesis is defined: „No difference in weight gain between the two protein diets.“ Calculation: Consider the null hypothesis is true. Calculate the probability to observe the actual result (or a more extreme result than the actual one). Wanted: a measure by which „extreme results“ can be defined, e.g. the rank-sum of the smaller group (this is group 2 with only 7 rats for the rat diet example). Excursion: Calculation of the rank-sums weight ranks of rank gain group 1 (smallest value) 1 70 2 83 2 3 85 4 94 5 97 5 6 101 7 104 7 8.5 107 8.5 8.5 107 10 113 10 11 118 12 119 12 13 123 13 14 124 14 15 129 15 16 132 17 134 17 18 146 18 (highest value) 19 161 19 Σ 190 Σ 140.5 107 ranks of group 2 1 3 4 6 8.5 11 16 Σ 49.5 26B4.3. Wilcoxon rank-sum test What is a more extreme example? The total rank-sum is 190. The mean rank per rat is 10. The mean rank-sum of 7 rats should be 70. The greater the deviation from 70, the more extreme is the result. Concretely: Assume there is no difference between the diet groups and calculate the probability to observe a rank-sum for 7 rats of 49.5 or smaller or of 90.5 (= 70 + (70-49.5)) or higher (again a two-sided test). To perform the Wilcoxon-Mann-Whitney U-test with SPSS, we have to click on Analyze Nonparametric Tests 2 Independent Samples... We move the variable GAIN to the field Test Variable List: and the variable GROUP to the field Grouping Variable: Besides the GROUP variable two question marks appear. We click on the button Define Groups... and enter the value 1 into the field Group 1: and the value 2 into the fied Group 2: These are the group codes discriminating low and high protein groups. Then we click on Continue In the box Test Type we choose Mann-Whitney U. Finally we click on OK and the requested Wilcoxon-Mann-Whitney U-Test will be calculated. 108 26B4.3. Wilcoxon rank-sum test First, we receive descriptive measures of the ranks per group. Ranks Weight gain (day 28 to 84) Dietary group high protein diet low protein diet Total Mean Rank N Sum of Ranks 12 11.71 140.50 7 7.07 49.50 19 The next table contains the test reuslts. Again, important parts have grey background. Test Statisticsb Mann-Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed) Exact Sig. [2*(1-tailed Sig.)] Weight gain (day 28 to 84) 21.500 49.500 -1.733 .083 .083a a Not corrected for ties. b Grouping Variable: Dietary group The calculated p-value („Asymp. Sig. (2-tailed)“) of 0.083 indicates no significant difference, similarly to the t-test. The advantage of the Wilcoxon-Mann-Whitney U-tests is that it can be used even if the assumptions of the t-test do not apply. On the other hand, if the assumptions of the t-test are valid (like in the rat diet example) then the t-test should be preferred to the Wilcoxon-MannWhitney U-test. A simple decision rule is: If means properly describe the location of the distribution in both groups then the use of the t-test is justifiable, as it is based on the comparison of the means. 109 26B4.3. Wilcoxon rank-sum test A rule of thumb warns for problems with the Wilcoxon-Mann-Whitney Utest. If one or both groups have less than 10 observations, then the asymptotic p-values („Asymp. Sig. (2-tailed)“) can become unprecise. In such situations, it is recommended to calculate “exact” p-values by using the permutation distribution. This rule of thumb is relevant for our example, as the smaller group contains only 7 rats. SPSS calculates the exact p-value for the Wilcoxon-Mann-Whitney U-test automatically (“Exact Sig. [2*(1-tailed Sig.)]“). In our example the exact pvalue of 0.083 is identical to the asymptotic p-value. However, the calculation of exact p-values is computationally difficult. The exact p-value calculated by SPSS in its standard version is not quite correct in case of ties in the data. SPSS refers in footnote "a" explicitly to this circumstance. Remark: The statement that “the exact p-value is not quite correct” is – admittedly – puzzling. This is due to the fact that the term “exact” only refers to the use of the permutation distribution to compute the p-value. Obviously, the term “exact” does not refer to the accurateness of this computation. As there are ties in our example (the value of 107 gramms appears in both groups) the question emerges how to calculate a correct exact pvalue? To perform a correct exact Wilcoxon-Mann-Whitney-U-test with SPSS, we have to click on Analyze Nonparametric Tests 2 Independent Samples... And fill in the fields as before. Click on the button Exact... and choose the option Exact. 110 26B4.3. Wilcoxon rank-sum test Test Statisticsb Weight gain (day 28 to 84) Mann-Whitney U 21,500 Wilcoxon W 49,500 Z -1,733 Asymp. Sig. (2-tailed) ,083 ,083a ,087 Exact Sig. [2*(1-tailed Sig.)] Exact Sig. (2-tailed) Exact Sig. (1-tailed) ,044 Point Probability ,004 a. Not corrected for ties. b. Grouping Variable: Dietary group With this option the exact p-value (“Exact Sig. (2-tailed)“) calculates to 0.087. This p-value differs only slightly from the asymptotical p-value. Remark: An exact p-value often is the only option in case of small sample sizes. Unfortunately, the term "exact" suggests this is a more “preferable” p-value. Thereby, a disadvantage of exact tests is often not realized: they are in general conservative, i.e. the specified error probability (of e.g. 5 per cent) is often not completely utilized (loosely speaking, “exact p-values are often too large”). So, in the case of the Wilcoxon rank sum test: if both groups have at least 10 observations, then we can use the asymptotic p-values. 111 27B4.4. Excercises 4.4. Excercises 4.4.1. The 24-hours total energy consumption (in MJ/day) was determined for 13 skinny und 9 heavily overweighted women. Values for skinny women: 6.13, 7.05, 7.48, 7.48, 7.53, 7.58, 7.90, 8.08, 8.09, 8.11, 8.40, 10.15, 10.88 Values for heavy overweighted women: 8.79, 9.19, 9.21, 9.68, 9.69, 9.97, 11.51, 11.85, 12.79 a.) Is there a difference in energy consumption between both groups? b.) You can find the study data in the data file b4_4_1.sav. An error occured at the setup of the data file? Which one? 4.4.2. You can find the data for the rat diet example in the data file b4_1_1.sav. Change the highest value in group 1 to 450 gramms. (a) Perform a t-test (b) Perform a Wilcoxon-Mann-Whitney U-test (c) Compare and comment the results 4.4.3. Would it be useful to determine several characteristics of the participants of this lecture (e.g. height in cm, age in years, already graduated yes/no) and then to test differences between men and women? 112 28B5.1. More about independent samples t-test Chapter 5 Statistical Tests II 5.1. More about independent samples t-test Revisiting example 4.1.1.: (scale variable) Animal experiment of Dr. X: Two groups of female rats receive food with high and low protein, respectively. Research question: Are there differences in weight gain (from 28. to 84. day)? The boxplot shows similar spread in both groups: 113 28B5.1. More about independent samples t-test If we calculate the t-test with SPSS then the first table that is shown in the output is a table with descriptive measures. There we can see that both groups have very similar variation (standard deviations of 21.39 and 20.62). Group Statistics Weight gain (day 28 to 84) Dietary group high protein diet N 12 Mean 120.000 Std. Deviation 21.3882 low protein diet 7 101.000 20.6236 Std. Error Mean 6.1742 7.7950 Interestingly, the result of the t-test is given in two rows, labelled „Equal variances assumed“ and “Equal variances not assumed”. As the standard deviations (and thus also the variances) are very similar in our example, we assume that differences (in standard deviations) are due only to chance and use the result of the first row. Independent Samples Test Weight gain (day 28 to 84) Equal variances assumed Equal variances not assumed t-test for Equality of Means Sig. (2Mean df tailed) Difference t 1.891 17 .076 19.0000 1.911 13.082 .078 19.0000 Independent Samples Test Std. Error Difference Weight gain (day 28 to 84) Equal variances assumed Equal variances not assumed 114 95% Confidence Interval of the Difference Lower Upper 10.0453 -2.1937 40.1937 9.9440 -2.4691 40.4691 28B5.1. More about independent samples t-test Why do we have to differentiate between a t-test with equal and a t-test with unequal variances We already know: the „mean difference” of 19 divided by its standard error of 10.045 results in the observed test statistic "t" of 1.891. If the null hypothesis is valid then this test statistic follows a t-distribution with 17 degrees of freedom ("df"). Using this information we can calculate the p-value. However, this is only valid in case of equal variances in the underlying populations. In case of unequal variances there is a problem. This is already known for a very long period and called Behrens-Fisher-problem. Briefly, the variation of the mean difference (“standard error of the difference”) and the degrees of freedom ("df") for the t-distribution have to be corrected. Thus we obtain a different p-value. When should we use the t-test for equal variances and when should we use the t-test for unequal variances? In principle, the t-test is robust against small to moderate deviations of its assumptions. A very rough rule of thumb to the question above states that if the standard deviations are different by a factor of 3 or higher then the version of the t-test with unequal variances should be used. However, the problem of equal/unequal variances strongly depends on the sample size. In case of large sample size it becomes more and more negligible. Besides this imprecise rule of thumb there is another option. Some of you may have thought: couldn’t variances be tested for equality using a statistical test? Yes, they could. SPSS uses the so-called Levene-test for Equality of Variances Null hypothesis: the two samples are drawn from populations having equal variances, or from the same population Alternative hypothesis: the variances are unequal We already know: A sensible measure to determine the deviation from the null hypothesis is required. SPSS uses the Levene-statistic (we will not discuss it in detail). Under the condition that the null hypothesis (equality of variances) is true, the probability can be calculated to observe the actual value or a more extreme value of the Levene-statistic. This probability is the p-value. 115 28B5.1. More about independent samples t-test SPSS automatically performs the Levene-test together with the t-test. Independent Samples Test Levene's Test for Equality of Variances Weight gain (day 28 to 84) Equal variances assumed Equal variances not assumed F Sig. .015 .905 For the Levene-test the p-value calculates to 0.9 for the rat diet example. The null hypothesis of equality of variances cannot be rejected, but this does not necessarily mean that the variances of the underlying populations are really equal. This is the big disadvantage of this test: In case of small sample size, where the test result would be very important for us, even huge differences in the variances result in an insignificant test. Whereas in case of large sample size, where different variances do not cause a problem any more, even very small and unimportant differences in the variances may become statistically significant. Thus: Use graphs to visualize the data. If there is uncertainty about the equality of the variances, then use the ttest for unequal variances. Remark: Other software packages (also earlier versions of SPSS) often use an F-test to test equality of variances instead of the LeveneTest. The F-test would give a p-value of 0.9788 for the rat diet example 4.1.1.. For this F-test the argument mentioned above also applies. What if the data in the groups are not normally distributed? We know that in this case an important assumption for the t-test is not valid. The Wilcoxon-Mann-Whitney U-test offers a useful alternative. 116 28B5.1. More about independent samples t-test We should not abandon the t-test too quickly, particularly if the data can simply be transformed into an approximate normal distribution. E. g., a logarithmic transformation often turns right-skewed distributions into normal. But caution! • Data transformations change the interpretation of the results. • Not always are data transformations successful. Association between independent samples t-test and analysis of variance This is an anticipation of a later chapter: In many cases we would like to compare not only two but more than two groups. This can be performed by an analysis of variance (thus also abbreviated by ANOVA). The t-test is the simplest version of the one-factorial (=one-way) analysis of variance. • The "factor" is an interesting characteristic measured on a nominal scale (e. g., a factor defining experimental groups). • "one-factorial" characteristic. means that we are only interested in one For illustration, we revisit the rat diet example 4.1.1. In order to perform a one-factorial (=one-way) ANOVA, we click on Analyze Compare Means One-Way ANOVA We move the variable GAIN to the field Dependent List: and the variable GROUP to the field Factor Notice: This time no question marks appear. This method is not restricted to two groups. We click on OK and the requested one-way ANOVA will be computed. 117 28B5.1. More about independent samples t-test ANOVA Weight gain (day 28 to 84) Sum of Squares Between Groups 1596.000 Within Groups 7584.000 Total 9180.000 df Mean Square 1 1596.000 17 F 3.578 Sig. .076 446.118 18 Besides incomprehensible items, we can find others which are already known. The p-value (“Sig.”) of 0.076 is the same as the p-value for the ttest. Also the degrees of freedom (“df”) with a value of 17 are familiar to us. If we draw the square root of the “F”-value of 3.578 we obtain the value of the test statistic “t” from the t-test, namely 1.891. Thus, an ANOVA with a factor with two levels corresponds to the independent samples t-test with equal variances. We can consider an analysis of variance as generalization of the t-test. Note that the same problems observed for the t-test also apply to ANOVA! (Remark: And others come along …) 118 29B5.2. Chi-Square Test 5.2. Chi-Square Test Example 5.2.1.: (binary outcome) Therapy comparison by Dr. X: standard therapy will be compared with new therapy, the outcome is binary (cured versus not cured) Research question: Is there a difference in the cure rates between therapies? cured yes no standard therapy 4 12 16 new therapy 9 9 18 13 21 34 standard therapy 25 % success new therapy 50 % success Conclusion of Dr. X: The new therapy is better than the standard therapy! Claim of colleague Y (obviously an unpleasant and jealous person): This result is due to pure chance, in reality the success rates are equal in both therapy groups! Again, the question arises: What now? Obviously we have to use a statistical test again. But the t-test and the Wilcoxon rank-sum test are not appropriate here, as the outcome is binary. 119 29B5.2. Chi-Square Test Instead, the Chi-squared test can be used to compare two groups with binary outcomes. Null hypothesis: „No difference in the cure rates between the two groups“ Calculation: Assume that the null hypothesis is true. Calculate the probability to observe the actual result (or even a more extreme result). Wanted: An intuitive measure to determine „more extreme results“ is needed, e.g. Pearsons Chi-square criteria. (Basic idea: use the squared differences between the actually observed result and the result expected if the null hypothesis was true.) If the null hypothesis applies: In total, there are 13 cured out of 34 cases or 38.2 % so (if the null hypothesis was true): • 38.2 % expected cured of 16 cases under standard therapy are 6.1 cured • 38.2 % expected cured of 18 cases under new therapy are 6.9 cured expected under null hypothesis cured yes no standard therapy 6.1 9.9 16 new therapy 6.9 11.1 18 13 21 34 The observed values are contrasted to the expected values under the null hypothesis by the Perarson Chi-square criteria. 120 29B5.2. Chi-Square Test cured yes standard therapy no 4 12 6.1 9 new therapy 9.9 9 6.9 11.1 Table: The observed values are given in the upper left corners of each cell and the expected values under the null hypothesis are given in the lower right corners. The Pearson Chi-square criteria is calculated as follwos: ( 4 − 6.1)2 + (12 − 9.9 )2 + ( 9 − 6.9 )2 + ( 9 − 11.1)2 6.1 9.9 6.9 11.1 = 2.2 Idea behind: 1.) observed minus expected (the bigger the differences the worse the agreement with the null hypothesis). 2.) square the differences (so that bigger differences are „penalized“ more). 3.) scale with expected values (so that differences at a high level count less than the same differences at a low level). Example to illustrate the idea behind: A difference of 7 with an expected value of 1000 is relatively small compared to a difference of 7 with an expected value of 20. 4.) Sum over all four cells. What is an extreme result? The higher the value of the Chi-square criteria the more extreme is the result. Concretely: Assume there is no difference between therapies. Calculate the probability to observe a Chi-square value of 2.2 or higher. To calculate the example 5.2.1 with SPSS, we create two binary variables, THERAPY and CURED: • THERAPY differentiates between therapy groups (0=standard therapy, 1=new therapy) • CURED contains the outcome, if a cure has occured (0=no, 1=yes) 121 29B5.2. Chi-Square Test To perform the Chi-squared test we click on Analyze Descriptive Statistics Crosstabs... We move the variable THERAPY to the field Row(s): and the variable CURED to the field Column(s): We click on the button Statistics... and choose Chi-square. Then we click on Continue and OK and the requested Chi-square test will be calculated. First, information about the number of valid cases is given in the output. Therapy * cure chances Case Processing Summary Cases Valid Missing N Percent N Percent 34 100,0% 0 ,0% Total N Percent 34 100,0% Then the 2×2-table is shown, where the therapy and the cures are crosstabulated. Therapy * cure chances Crosstabulation Count cured no Therapy standard new Total Total yes 12 4 16 9 9 18 21 13 34 122 29B5.2. Chi-Square Test The last table contains the requested results of the Chi-squared test. Important parts have a grey background. Chi-Square Tests Value Asymp. Sig. (2-sided) df 2,242a 1 ,134 Continuity Correctionb 1,308 1 ,253 Likelihood Ratio 2,286 1 ,131 Pearson Chi-Square Fisher's Exact Test Exact Sig. (2-sided) Exact Sig. (1-sided) ,172 Linear-by-Linear Association 2,176 N of Valid Cases 34 1 ,126 ,140 a 0 cells (,0%) have expected count less than 5. The minimum expected count is 6,12. b Computed only for a 2x2 table The Pearson Chi-square criteria is a test statistic that asymptotically follows a Chi-square distribution with one degree of freedom ("df") for a 2×2-crosstable under the null hypothesis. For the observed test statistic value (“Value”) of 2.242 the p-value ("Asymp. Sig. (2-sided)") calculates to 0.134. This means: Again the objection of colleague Y cannot be invalidated! A rule of thumb points to problems of the Chi-square test. If in one of the four cells the expected frequency under the null hypothesis is smaller than 5, then the so-called Fisher’s Exact test should be used instead. This rule of thumb applies in our example. SPSS refers to this rule of thumb with footnote "b". More about Fisher's exact test can be found in the Appendix about exact tests. 123 30B5.3. Paired Tests 5.3. Paired Tests Example 5.3.1.: The daily dietary intake of 11 young and healthy women was measured over a longer period. The women didn’t know the aim of the study in advance, which was the comparison of pre- and post-menstrual ingestion, to avoid any deliberate influence on the study results. The mean dietary intake (in kJ) over 10 pre- (PREMENS) and 10 post-menstrual days (POSTMENS) of each woman is given in the following table: WOMAN 1 2 3 4 5 6 7 8 9 10 11 PREMENS 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770 POSTMENS 3910 4220 3885 5160 5645 4680 5265 5975 6790 6900 7335 The research question is: Is there a difference between pre- and postmenstrual dietary intake? First idea: Use t-test or Wilcoxon-Mann-Whitney U-test Second idea (or rather a question): Is this an appropriate setting for the situation? Contrary to the previous 2-group-comparisons, we now deal with two measurements per individual. Thus we have a situation with dependencies. We call this a paired situation. Two-group-comparisons with only one measurement per individual are called unpaired. Basically, it is possible to use the already known unpaired version of the ttest or the Wilcoxon-Mann-Whitney U-test for paired situations. However, as measurements usually are more similar within a person than measurements taken from different persons, considering the paired situation leads to higher power of the analysis. The question „Is there a difference betwen pre- and post-menstrual dietary intake?“ is equivalent to the question „Is the mean difference between pre- and postmenstrual dietary intake unequal to zero?“ 124 30B5.3. Paired Tests The suitable test is the paired t-test. We proceed as usual (according to the principle of statistical testing)! Concretely: • Null hypothesis: „The mean difference between pre- and postmenstrual food intake is equal to zero!“ (also: no effect) • Two-sided alternative hypothesis: „The mean difference is unequal to zero!“ • Intuitive measure for the deviation from the null hypothesis: absolute value of the observed mean differences • When does a result count as more extreme than the observed one? Obviously, if the absolute mean difference is higher than the observed one If the distribution of the differences approximately follows a normal distribution then the use of the paired t-test is appropriate. The mean absolute difference has to be divided by its estimated variation to receive a t-distributed value. Then the p-value can be calculated. Procedure in SPSS: First, we have to verify graphically, if the distribution of the differences is approximately normal. (To be clear: the differences PREMENS-POSTMENS should be approximately normally distributed! The distributions of the variables PREMENS and POSTMENS are not of importance here.) 125 30B5.3. Paired Tests We accept that deviations from a normal distribution are moderate. So we can perform the paired t-test. We ckick on Analyze Compare Means Paired-Samples T Test... We choose the variables PREMENS and POSTMENS and move them to the field Paired Variables: They appear as Pair 1 under Variable1 and Variable2, respectively. We click on OK and the requested paired t-test will be calculated. First, we receive a descriptive table and a table with correlation coefficients. Then we can see the result Paired Samples Test Paired Differences Mean Pair 1 Pre-menstrual dietary intake (kJ) Post-menstrual dietary intake (kJ) 1320,455 Std. Deviation Std. Error Mean 366,746 110,578 95% Confidence Interval of the Difference Lower Upper 1074,072 1566,838 Here, the mean and standard deviation of the paired differences can be seen. From the 95% confidence interval we learn that the dietary intake is statistically different during the female cycle (zero is not covered). 126 30B5.3. Paired Tests Paired Samples Test t Pair 1 Pre-menstrual dietary intake (kJ) Post-menstrual dietary intake (kJ) 11,941 df 10 Sig. (2-tailed) ,000 The p-value is <0.001, based on a two-sided test. It confirms what we have seen before from the two-sided confidence interval. Is there a non-parametric alternative to the paired t-test? There are even two, the Wilcoxon signed ranks test and the sign test. The problem with the Wilcoxon signed ranks test is, that its result may be dependent on data transformations! This is completely unususal for a nonparametric test. Thus the Wilcoxon signed ranks test should only be used if the distribution of the differences is symmetric. Some statisticians recommend the use of the sign test as non-parametric alternative to the paired t-test. However, the sign test is not vey powerful. Other statisticians recommend to transform the data if the raw data show larger differences for higher baseline values and then to apply the Wilcoxon signed ranks test. However, this is only recommendable if the transformation actually results in a symmetrical distribution of the differences. How can we verify if higher differences occur at higher baseline values? We can plot the differences (X-Y) versus the means (X+Y)/2, see the appendix on page 178. 127 30B5.3. Paired Tests To perform the Wilcoxon signed ranks test and the sign test for example 5.3.1., we click on Analyze Nonparametric Tests 2 Related Samples... We choose the variables PREMENS and POSTMENS and move them to the field Test Pairs: They appear as Pair 1 under Variable1 and Variable2, respectively. In the field Test Type we choose Wilcoxon and Sign Then we ckick on OK and the two requested tests will be calculated. For the Wilcoxon signed ranks test a p-value of 0.003 and for the sign test a p-value of 0.001 is calculated. Remarks: • Paired situations are not restricted to only two measurements within a patient, e.g. a matched case-control study also constitutes a paired situation. The individual differences will be calculated from the two observations of each matched pair (=case and its control). Caution! Matched case-control studies are easy to analyze but their dangers are great. Especially the choice of controls is a source of biased and false statements. • Paired situations are the simplest example of a more general concept, called blocking. In general every experimental unit with more than one measurement constitutes a block. The denomination block emanates from agriculture. 128 30B5.3. Paired Tests Example 5.3.2.: In 1968 a diagnostic study about vascular occlusion was performed at the 1st Department for Surgery (AKH, Vienna) with 121 patients. The diagnostic tools CLINICAL and DOPPLER were compared. The true diagnosis was assessed by venography. Here, only patients with vascular occlusion were considered. Clinical Doppler + + - + + - number of patients with vascular occlusion sum 22 3 16 3 44 In 25 cases (22 plus 3) the underlying vascular occlusion was diagnosed correctly by the clinical method. The correct diagnosis with the Doppler tool was made in 38 cases (22 plus 16). Thus, the sensitivity for the clinical diagnosis is 25/44=57 % and the sensitivity for Doppler is 38/44=86 %. Obviously, the diagnosis with Doppler is more sensitive than the clinical diagnosis. The question arises if this difference can sufficiently be explained by chance? Thus, we need again a statistical test. The structures of examples 5.3.1 and 5.3.2. are very similar. In the first example, the outcome was measured twice for each woman (pre- and post-menstrually) and in the second, the outcome is measured twice for each patient (clinical, Doppler). The main difference between both examples is the type of the outcome. In example 5.3.1. the outcome variable was scale (dietary intake in kJ) and now it is dichotomous (+, -). The correct statistical test in example 5.3.1. was nothing else than a paired version of the simple unpaired t-test. Similarly, the correct statistical test for example 5.3.2 is a paired version of the simple Chisquare test, the so-called McNemar test. The analogy continues. Only the difference between two paired measures is used for the paired t-test. The same is true for the McNemar test: the information from two paired measurements is reduced to one value. We keep in mind: the McNemar test is used for paired situations with dichotomous outcome. It is a variant of the sign test. Now the data in b5_3_2.sav are used. However, we should not forget to Weight Cases.... 129 30B5.3. Paired Tests If we cross-tabulate the diagnosis tools clinical and Doppler, then we obtain the following table: diagnosis by Doppler clinical diagnosis + total + total 3 16 (=f) 19 3 (=g) 22 25 6 38 44 In three patients, both methods lead to a negative diagnosis and in 22 patients, both methods lead to a positive diagnosis. Such identical results are called as concordant. To answer the question which diagnosis tool is more sensitive, the concordant results are unimportant. The comparison of the sensitivities 25/44 and 38/44 reduces to the comparison of the number of correct diagnoses 25 and 38. The 22 concordant results appear in both numbers, which leads finally to a comparison of the so-called discordant results, 3 and 16 (marked in the table with “g” and “f”). Thus, different diagnoses are discordant results. Only patients with discordant diagnosis results provide relevant information to answer the research question. This is the starting point of the McNemar test. We proceed as before (principle of statistical testing!) Concretely: • Null hypothesis: „The difference between both discordant results (f minus g) is equal to zero!“ (also: both diagnosis tools are equally sensitive) • Two-sided alternative hypothesis: „This difference is different from zero!“ • Intuitive measure for the distance from the null hypothesis: Chi-square criterion • When is a result more extreme than the observed one? The higher the Chi-square criterion, the more extreme is the result! 130 types of 30B5.3. Paired Tests If the null hypothesis was true: 19 discordant pairs were observed. If clinical diagnosis and Doppler diagnosis were equally sensitive, then we expect • 9.5 times clinical „-“ and Doppler „+“ and • 9.5 times clinical „+“ and Doppler „-“. Now we have to use these expected values to calculate the Chi-square criterion. Remember: Calculate for each cell (observed minus expected)squared, then divide them by expected and sum over the cells. Contrarily to the already known ordinary Chi-square test, only the two discordant cells are used for the McNemar test: (16 − 9.5)2 + ( 3 − 9.5)2 9.5 9.5 = 8.9 2 This formula can also be simplified to ( f − g ) f +g If you insert for f=16 and g=3 then the simplified formula also results in a value of 8.9. Of course, also an exact version of the McNemar test exists, which is especially recommended for small values of f and g. SPSS offers two menus to calculate the McNemar test. 1st option: We click on Analyze Nonparametric Tests 2 Related Samples... We click on the variables CLINICAL and DOPPLER and move them to the field Test Pairs: They appear as Pair 1 under Variable1 and Variable2, respectively. In the field Test Type we choose McNemar. Then we click on OK and the requested test will be calculated. SPSS gives automatically the exact p-value of 0.004 („Exact Sig. (2-tailed)”). 131 30B5.3. Paired Tests 2nd option: We click on Analyze Descriptive Statistics Crosstabs... Now, we move the variable CLINICAL to the field Row(s): and the variable DOPPLER to the field Column(s):. We click on the button Statistics... and choose McNemar. Then we click on Continue and OK and the requested test will be calculated. The exact p-value is again 0.004. Finally, only the interpretation of the test result is missing. The p-value of 0.004 is smaller than the commonly used significance level in medicine of 0.05. Thus the test result is statistically significant and we can reject the null hypothesis. Diagnosis of a vascular occlusion with Doppler is more sensitive than a diagnosis by clinical assessment. Summary All statistical tests that have been introduced in this and the last chapter are summarized in the following table: Outcome variable Compare two independent groups Compare two paired groups t-test paired t-test scale or ordinal Wilcoxon rank-sum test Wilcoxon signed-ranks test or sign test binary Chi-square test McNemar’s test scale and normally distributed 132 31B5.4. Confidence intervals 5.4. Confidence intervals Let’s return to example 4.1.1, the rat diet data set. Recall, we computed a p-value of 0.076. Comparing this p-value to the usual significance level of 5%, we conclude that the null hypothesis of equal means cannot be rejected. • • Should we be content with that conclusion? Or is there more utilizable information behind? The t-test uses the mean difference between two groups to detect deviations from the null hypothesis. This mean difference has a scientifically useful meaning on its own (in contrast, the rank sum computed by the Wilcoxon rank sum test has not). We could replace the scientific question “Are there differences between the underlying populations?” by “How large are these differences?” Clearly, if we observe a difference in mean weight gain of 19 grams between the high and low protein dietary groups this doesn’t mean that the “true” difference is exactly 19 grams. On the other hand, we expect the “true” or population difference to be close to 19 grams. It would be of great value if we could specify an interval within which the “true” difference is believed to fall. Such a specification can never be certain, because inference based on a few (a small sample) about all (the population) cannot offer 100% certainty (unless we restrict the conclusions to uninteresting statements like: “it is certain with 100% probability, that the absolute difference in weight gain of the rats will be less than 5000 kg”). Usually, we decide to make statements which can be assigned 95% certainty (sometimes also 90% or 99%). We are seeking an interval, which covers the “true” weight difference with high probability. Such an interval is referred to as confidence interval, and along with the interval, we also specify the degree of confidence which is assigned to that interval, e. g. a “95% confidence interval” or a “confidence interval with level 95%”. Confidence intervals are based on the following concept: assume we could repeat the study very often (collecting new data each time) and for each of the study repetitions we compute a 95% confidence interval for the weight difference. Then 95% of these intervals would cover the “true” difference. 133 31B5.4. Confidence intervals For example 4.1.1, confidence intervals have already been computed. They are output along with the t-test results. Independent Samples Test t-test for equality of means Std. Error 95% Confidence Interval Difference of the Difference Lower Upper Weight gain (day Equal variances 28 to 84) assumed 10.0453 -2.1937 40.1937 Equal variances not assumed 9.9440 -2.4691 40.4691 The 95% confidence interval for the difference in weight gain between high and low protein dietary groups amounts to -2.2 grams to 40.2 grams. Confidence intervals are very closely related to statistical tests. They could even replace the specification of p-values. If a confidence interval doesn’t cover the difference stated in the null hypothesis (usually 0), then the test result is statistically significant on the pre-specified significance level. A 95% confidence interval always corresponds to a test with significance level of 5%, a 99% confidence interval to a test at the 1% level etc. In our example, the 95% confidence interval covers the null hypothesis (=0 grams weight gain difference). Therefore, the test fails to be significant at the 5% level. Note: if equal variances in the two groups cannot be assumed, a modified version of the confidence interval has to be used. Make use of confidence intervals! 134 32B5.5. One-sided versus two-sided tests 5.5. One-sided versus two-sided tests Recall: The p-value is the probability of a result at least as extreme as the observed one assuming the null hypothesis applies. Clearly, if the null hypothesis is true extreme results occur in any direction with the same frequency. Concretely with the rat diet example: If there is no difference in weight gains between the two protein diets and we could repeat the experiment quite often, then sometimes the rats with low protein would gain more weight and sometimes the rats with high protein would gain more weight - distributed just by pure chance! We consider this case by performing two-sided tests and thus calculate two-sided p-values. This is what we have done up to now. In the majority of cases this will be the only correct approach. In rare cases, there can be research questions where scientific meaningful differences can only appear in one direction. Observing a difference in the opposite direction would always be regarded as due to chance – regardless of the size of this difference. A typical example are doseresponse relationships in toxicology – an increase of the dose can not lead to decreased toxicity. In such a case the alternative hypothesis would have to be restricted to one direction only. A potential alternative hypothesis for the rat diet example could be: "Food with low protein causes higher weight gain than food with high protein." Thus, we would perform a one-sided t-test and calculate a one-sided p-value. One-sided tests are rarely suitable. Even if we have the strong presumption that the new therapy could not be worse than the present therapy, we cannot be sure. If we were sure then we wouldn’t need to perform an experiment! 135 32B5.5. One-sided versus two-sided tests Whether a one-sided test is considered appropriate has to be decided before data analysis or even better, before data collection. For prospective studies, the intended test strategy (and the scientific reasons therefore) is usually recorded in the study protocol in order to avoid unpleasant discussions afterwards. The decision for a one-sided test should in no case depend on the result of the experiment or the study. In medical literature, one-sided p-values often fall between 0.025 and 0.05. This means: a two-sided test wouldn’t have been significant! One can assume in many of these cases that there were no records about the use of one-sided hypotheses in advance. The nominal significance level will not be preserved by a one-sided alternative hypothesis that depends on the result of the experiment. The actual significance level would be the double of the nominal one, namely 10%. • Strictly use only two-sided tests! • Use one-sided tests only if you have planned this at the beginning of the study and if you have stated scientifically justifiable reasons in the study protocol! • Be in principle suspicious about studies which use one-sided test results without reasonable justification! 136 33B5.5 Exercises 5.5 Exercises 5.5.1. 27 out of 40 16-year old boys and 12 out of 32 equally-aged girls underwent a questionnaire about their smoking behaviour. You can find the aggregated data in the data file b5_5_1.sav. Does the smoking behaviour depend on the sex of the teenagers? 5.5.2. Values for thyroxin in the serum (nmol/l) of 16 children with hypothyroidism can be found in the data file b5_5_2.sav, differentiated by strength of the symptoms („no to slight symptoms“ versus „pronounced symptoms“). No to slight symptoms: 34, 45, 49, 55, 58, 59, 60, 62, 86 Pronounced symptoms: 5, 8, 18, 24, 60, 84, 96 Compare the thyroxin values between both symptom groups. 5.5.3. You can find data from two patient groups in the data file b5_5_3.sav. The first group (Hodgkin=1) consists of Hodgkin patients in remission. The second group (Hodgkin=2) is a comparable adequate group of non-Hodgkin patients, also in remission. The numbers of T4 and T8 cells/mm3 blood were determined. (a) Are there differences in T4 cell counts between both groups? (b) Are there differences in T8 cell counts between both groups? 137 33B5.5 Exercises 5.5.4. Often, a logarithmic transformations are applied to right-skewed data. The reason for this is simple: If the logarithmic values in both groups are approximately normally distributed then a t-test can be applied to these data. What does this mean for the interpretation of the results on the original scale (without logarithmic transformation)? Guidance: Which mathematical operation has to be applied to return from the logarithmic scale to the original scale? What are the effects of this mathematical operation on a difference? What are the effects of this mathematical operation on the null hypothesis of the t-test? 5.5.5. Do your considerations of example 5.5.4. change your analysis of example 5.5.3? Guidance: Try a logartihmic transformation on the cell-counts of example 5.5.3. Is the t-test after the transformation reasonable? If yes, perform a t-test and interpret the results. If no, specify reasons for your decision. 5.5.6. Convert the data of the rat diet example 4.1.1 from SPSS to EXCEL. Perform t-tests with equal and unequal variances and a test for equality of variances. 5.5.7. Use the data from example 5.3.1. and generate the variable DIFF from the difference PREMENS minus POSTMENS. Click on Analyze Compare Means One-Sample T Test... and move the variable DIFF to the field Test Variable(s):. Then click on OK. (a) What attracts your attention compared to the paired t-test? Do you have an explanation for that? (b) Can you figure out what the field Test Value means? Why is there a zero? What happens if you change this value? 138 33B5.5 Exercises 5.5.8. The following study was performed with 173 patients with skin cancer. The skin reaction of each patient with the contact allergen Dinitrochlorobenzol (DNCB) and with the skin irritating and inflammation actuating Kroton oil was assessed. +ve ... skin reactions -ve .... no skin reactions The aim of this study was to answer the question if for patients with skin cancer the contact to DNCB causes different skin reaction than contact to Kroton oil. Here are the study results: skin reaction to DNCB Kroton oil frequency +ve +ve 81 +ve -ve 23 -ve +ve 48 -ve -ve 21 Perform the corresponding analyses. 5.5.9. Here are the complete data for the diagnosis study about vascular occlusion (example 5.3.2). The diagnosis tools clinical and Doppler were compared. The true diagnoses were assessed by venography. clinical Doppler + + - + + - number of patients with occlusion without occlusion sum 22 3 16 3 44 Compare the specificities of both diagnostic tools. 139 sum 4 27 5 41 77 33B5.5 Exercises 5.5.10. A matched case-control study was performed to evaluate if undescended testicle (maldescensus testis) at birth leads to increased risk for testicular cancer. Cases were 259 patients with testicular cancer. For each case a control patient from the same hospital was matched who was of about the same age (± 2 years), belonged to the same ethnic group and suffered on a disease different from testicular cancer. In the data file b5_5_10.sav you can find the corresponding data. Answer the research question by using these data. Additional remark: We have to be careful with the term „case“, as it has two different meanings: (i) A case in this case-control study is a patient with testicular cancer. (ii) A case in the data matrix is a pair, which consists of a patient with testicular cancer and the corresponding matched control. 5.5.11. Example 5.5.5 continued: compute a 95% confidence interval for the original scale (cells/mm3 blood). How can you interpret the confidence interval? 5.5.12. Use the data set of the rat dietary example 4.1.1. Compute a one-sided t-test corresponding to the alternative hypothesis “low protein diet leads to more weight gain than high protein diet” and specify a one-sided 95% confidence interval. Compute a one-sided t-test corresponding to the alternative hypothesis “high protein diet leads to more weight gain than low protein diet” and specify a one-sided 80% confidence interval. 5.5.13. Use the rat dietary example data set (b4_1_1.sav). Add a fixed value to all weight gains corresponding to group 1. Choose the value such that the t-test yields a p-value of exactly 0.05. Choose the value such that the t-test yields a p-value of exactly 0.01. Hint: if you cannot proceed just by thinking, try to solve the exercise by “trial-and-error”. Then, try to find distinctive features according to your solutions. By doing so, pay special attention to the mean difference between the groups and the corresponding confidence limits. 140 33B5.5 Exercises 5.5.14. A team of authors had submitted a paper to a medical journal. In the review of the manuscript, the referees criticized the statistical method that had been applied to analyze the data. They encouraged the authors to present the results by means of confidence intervals. The first author replied as follows: “In studies as ours, however, given the relatively small numbers, a confidence interval … is likely to contain zero, be fairly wide and include both positive and negative values. Therefore this is not an appropriate setting for this form of analysis as the result always will be too diffuse to be meaningful.” In future, will you accept such a rationale? Whether or not, give reasons for your decision. 5.5.15. Perform an independent sample t-test on the data of exercise 5.3.1 (female ingestion) and compare the results to those of the paired sample t-test. 5.5.16. The following table contains the respiratory syncytial virus (RS) antibody titers measured in nine different patients. Patient number Titer 010174 8 020284 16 019459 16 011301 32 000232 32 024336 32 015319 64 009803 64 007766 256 (a) Compute the geometric mean titer (GMT). (b) Compute a 95% confidence interval for the GMT. (c) How would you describe the computations of (a) and (b) to be suitable for the statistical methods section of a scientific publication? 141 34BOpening and importing data files Appendices A. Opening and importing data files Opening SPSS data files (*.sav) SPSS data files can directly be opened using the menu File-OpenData... Not only SPSS data files can be opened using this menu, as the following list illustrates: The most important are the SPSS (sav), Excel (xls), SAS Long File Name (sas7bdat), and Test (txt) file types. Opening Excel data files Be careful with data bases created by Excel. As mentioned above, Excel imposes no restrictions on the type of data the user enters in a table. Frequent errors when importing Excel files to SPSS are: data columns are empty, date formats cannot be erased, or numeric columns are imported as string types, etc. In order to avoid such problems, some rules should be obeyed: • • • • • • Clean the data in Excel before importing it into SPSS: the first row should contain variable names, all other rows should only contain numbers! Pay attention for a consistent use of the comma. Be careful with different language versions. Erase any units of measurements in data columns. Pay attention to hidden spaces in data columns; they should be erased by replacing them with “nothing”. Don’t use tags in data columns, like question marks for imprecisely collected data (see above). Sometimes formats once assigned to data columns cannot be erased. This problem can be solved by copying the complete data area and pasting the “contents” (using “edit-paste contents”) into a new sheet. 142 34BOpening and importing data files • Wrong date formats in a column can be erased by selecting the column and choosing General from the menu Format-cells: 143 34BOpening and importing data files First, the Excel data file should be opened in Excel: The first row should only contain SPSS compatible variable names. Data values should follow from the second row. Column headers are changed accordingly (if not, SPSS will generate variable names automatically, and will use the column headers as variable labels): 144 34BOpening and importing data files Now the first row is erased and the file is saved using a different name like “bloodpress2”. Then the data file is opened using SPSS, selecting File-Open-Data.... Change Files of type to Excel (*.xls, *.xlsx, *.xlsm) and press Open: 145 34BOpening and importing data files Now a window pops up in which the working sheet of the Excel file and the data area can be specified. Choose Read variable names from first row of data such that variable names are adopted from the Excel file. Otherwise, the first row would be treated as a row containing data values. The data is imported into SPSS, and should be saved as SPSS data file immediately. Importing data from text files A further possibility to import data from other programs is provided by the text import facility. Text files can easily be created by virtually any program (Word, Outlook, etc.). In Outlook, save an e-mail using FileSave As and choose Text files (*.txt) at the Save as type: field. 146 34BOpening and importing data files 147 34BOpening and importing data files By double-clicking the text file (data.txt) in the explorer, the notepad opens. Delete any rows that do not contain data values or variable names: Save the file as Data2.txt. In SPSS, select File-Open-Data and choose Text (*.txt, *.dat) from Files of type:. Select Data2.txt and press Open. The so-called Text Import Wizard opens and guides through the data import: 148 34BOpening and importing data files At step 2, change Are variable names included at the top of your file? to Yes. 149 34BOpening and importing data files 150 34BOpening and importing data files In our case, columns are separated by the Tab. Steps 5 and 6 require no modification from the user. Finally, the data file is imported and appears in the SPSS Data Editor window. Opening multiple data sets From version 14, SPSS is able to handle multiple data sets in multiple windows of the SPSS Data Editor. These are distinguished by subsequent numbers ([DataSet1], [DataSet2], …). The active window contains the socalled active data set, i. e., the data set to which all subsequent operations apply. While a dialogue is open, one cannot change to another data set. During an SPSS session, at least one SPSS Data Editor window must be open. When closing the last Data Editor window, SPSS is shut down and the user is asked to save changes. Copy and paste data One may also import data by copying data areas in Excel or Word and pasting them into SPSS. However, keep in mind that • In the data view, only data values can be pasted 151 34BOpening and importing data files • Numeric data can only be pasted into numeric variables, character strings can only be pasted into variables defined as string type Copy&Paste is very error-prone, it should only be used for very small amounts of data that can easily be checked visually, and only if there is no other possibility to import data. Data may also be copied between SPSS Data Editor windows. Note that when copying a range of cells in a data view table, only data values will be pasted into another data view table. Variable descriptions will only be pasted if a complete column is copied (including the column header). Data from MS Access Data bases from MS Access can be opened by SPSS using the Database Wizard (File-Open database-New query). Database queries require a working installation of the ODBC drivers, which come with a complete Microsoft Office installation. 152 34BOpening and importing data files Exercises A.1 Reading Source: Timischl (2000). Eight probands completed a training to increase their reading ability. Reading ability was measured as words per 20 minutes before start and after completion of the course. The data was entered into the Excel table “reading.xls”. Import the data into SPSS. Define a score that reflects the increase in reading ability and compute the score for each patient (“Transform-Compute Variable...”). Save the data set as “reading.sav”. A.2 Digoxin The Excel table “digoxin.xls” contains dixogin readings for 10 patients, measured on six consecutive days. Import the table into SPSS. Pay attention to the correct definition of the variables. Save the data file as “digoxin.sav”. A.3 Alanine Aminotransferase The Word document “Alanine Aminotransferase.doc” contains the frequency distribution of ALT (Alanine Aminotransferase) in 240 medical students. Import the table into SPSS. Use the text import wizard to properly import the table! Weight the rows by the frequency variable. Save the data file as ALT.sav. A.4 Down’s syndrome Source: Andrews and Herzberg (1985). The text file “down.txt” contains the frequency of all livebirths from 1942 to 1957 in Victoria, Australia, and the number of newborn children with Down’s syndrome. The data are sorted by age group of the mothers. Import the data set into SPSS and save it as “down.sav”. 153 35BData management with SPSS B. Data management with SPSS Merging data tables If a data base is saved in separate tables, then these tables can be merged to one table. Tables can be merged • • one below the other, or side by side Example B.1: multi-center trial. Clinical trials are often recruiting patients in multiple hospitals or “centers”. Data can be entered into the computer system locally, and merged only at time of analysis. Thus, every center supplies a data file with different patients and these data files have to be merged one below the other. Example B.2: longitudinal clinical study. Clinical trials often involve repeated measurements on patient characteristics (e. g., blood pressure before treatment and every month after start of treatment). On the other hand, some variables are collected only once (e. g., age at start of treatment, sex, etc.). Therefore, it is most efficient to create separate tables for the baseline and the repeated data. At time of analysis, the tables are merged using a unique patient identifier (patient identification number). Merging tables one below the other Consider the data file “bloodpress.sav” and an additional data file containing the data of five other patients, “appendix.sav”. These two data files should be merged. First, open “bloodpress.sav” and “appendix.sav” with the SPSS Data Editor. Now two windows of the SPSS Data Editor should be open. Now go to the window containg “bloodpress.sav” and choose Data-Merge Files-Add Cases from the menu. Select appendix.sav and press Continue. 154 35BData management with SPSS As the variables do not have the same names in both data files, they must be paired using the following sub-menu: Variables corresponding to the active data set are marked by “(*)” and those corresponding to appendix.sav by “(+)”. Now select two matching variables from either data set and press Pair. Note that the second variable can be selected by holding the “Strg” (“Ctrl” on English keyboards) key while clicking on the variable name. 155 35BData management with SPSS Repeat this procedure for all pairs of variables until all variables appear at Variables in New Active Dataset: Press OK. Note that the asterisk (*) before the data set name in the active SPSS Data Editor window indicates that changes have not yet been saved. Merging tables side by side Now assume that the ages of the patients are saved in a separate data file (“age.sav”). First, open “age.sav” using the SPSS Data Editor. This data file contains two variables: proband identification number and age. When this data set should be merged with our active data set, it is important that the proband identification numbers agree. Choose Data-Merge Files-Add Variables and select age.sav. 156 35BData management with SPSS In order to ensure that the data files are merged correctly, we should define “proband” as key variable and match the two files using this key variable. Select proband from the field Excluded Variables:, choose Match cases on key variables in sorted files and Non-active dataset is keyed table and put proband into the field Key Variables by pressing the corresponding arrow button. Press OK, the following warning appears on the screen: This warning reminds us that both data files must be sorted by the key variable before merging. The data sets are already sorted in our example. Please not that in general, data sets should be sorted (Data-Sort Cases) before merging! After pressing OK the merged data set appears in the active Data Editor window: 157 35BData management with SPSS If some patients are completely missing in one of the data files, then choose Both files provide cases. Otherwise, only patients present in both files might appear in the merged data set. Computing new variables New variables can be computed using the menu Transform-Compute Variable. Computations are carried out row-wise. Assume that we wish to compute the change in blood pressure measured in sitting probands between the time points “before treatment” and “after treatment”. This can be done by computing the difference of the corresponding measurements. Choose Transform-Compute Variable from the menu: 158 35BData management with SPSS A new variable is created (“sit_31”) in which the difference between posttreatment measurement and pre-treatment measurement is computed for each proband. Negative values indicate a decline in blood pressure. Note: Spreadsheet programs (Excel) often treat empty cells as 0. Computations on empty cells can have surprising results! In SPSS, empty cells are treated as missing data, and computations involving missing data usually lead to empty cells in the outcome variable. The values computed by Transform-Compute Variable are treated as observations on further variables, which are statically linked to the corresponding row (observation). SPSS does not memorize the expression leading to the computed data values, it can only save the expression in the variable label of the new variable. If some values of sit_3 or sit_1 are changed after the computation of sit_31, then the computation has to be repeated in order to update sit_31. This is a major difference between statistics programs that usually assume fixed data values to spreadsheet programs, in which computed values are dynamic relationships. Creating a variable containing subsequent numbers 159 35BData management with SPSS Observations can be numbered consecutively by the Special Variable “$Casenum”. This variable is generated by SPSS automatically, but usually not visible to the user. $Casenum contains, for each observation, the current row numbers. $Casenum can be used to automatically generate patient identification numbers in the following way: Choose Transform-Compute Variable and enter PatID as target variable, and $Casenum (which can also be found selecting Miscellaneous in the field Function group. Press OK. Please note that the patient number “patid” will only be generated for existing observations. If observations are added later, then the procedure must be repeated. Again, the data values of patid are statically linked to the observations. If you want to use $Casenum to generate a patient identifier in an empty sheet, then you should first enter an arbitrary value in the last row needed, such that the patient identifier is generated for the correct number of observations. Functions 160 35BData management with SPSS The menu Transform-Compute Variable can be used to perform mathematical operations of any kind. Basic arithmetical operations are plus (+), minus (-), multiplied by (*) and divided by (/). The exponent (“to the power of”) is symbolized by **. Boolean (logical) operations include greater than (>), less than (<), equal to (=), not equal to (~=), and the logical AND (&), OR (|) and NOT (~). Furthermore, various other functions are provided. The definitions of the functions can directly accessed by selecting a function from a function group. Recoding variables The menu Transform-Recode helps in recoding variables. The most common application of that menu is to define codes for particular value ranges (domains). Assume we want to categorize the blood pressure into “normal” (up to 150), “high” (151-200) and “very high” (>200). Choose Transform-Recode-Into Different Variables from the menu: First put all blood pressure variables into the field Numeric Variable -> Output Variable:. Then press Old and New Values. Here, domains can be assigned codes. 161 35BData management with SPSS The first code (“New Value”), 1, will be assigned to the domain 150 or less. Insert 150 into the field labeled “Range, LOWEST through value:”. Press “Add” and define the other two codes, 2 for the domain 151 through 200 (the fourth option on the left hand side), and 3 for the domain 201 through HIGHEST (the sixth option, insert 201). The definition of each domain/code assignment must be confirmed by pressing Add. Finally, we arrive at: 162 35BData management with SPSS Then press Continue. Now we have to define names for the six output variables containing the codes. Select the first input variable from the field “Numeric Variable -> Output Variable” and create a name for the corresponding output variable in the field “Output Variable Name:”. Press Change and proceed to the second input variable etc. When all output variables have been named, press OK. Six new variables containing domain codes have been created in the Data Editor window. Now we can assign value labels to the codes using the Variable View of the Data Editor. Switch to “Variable View” (lower left corner) and select variable “sit_1c”. Then click on None in the column “Values”. Define the value labels for the three possible codes that the variable may assume: 163 35BData management with SPSS Click OK. The definition of the value labels can be copied and pasted into the corresponding field of the other variables: 164 35BData management with SPSS Exercises B.1 Hemoglobin Source: Campbell and Machin (1993). A data set of a study investigating the relationship of hemoglobin (hb), hematocrit (packed cell volume, PCV), age and menopausal status (menopause) is separated into three tables: Hemoglobin1.xls contains hb, pcv and meno from all premenopausal patients Hemoglobin2.xls contains the same data from all postmenopausal patients Hemoglobin-age.xls contains the age of all patients. Import all tables into SPSS and merge them into one SPSS data file. Pay attention to sorting when adding the age table; the patient number should be used as key variable! Save the file as “hemoglobin.sav”. B.2 Body-mass-index and waist-hip-ratio Two tables, controls.xls and patients.xls contain data on age, weight, height, abdominal girth (waist) and hip measurement (hip) as well as sex for healthy control individuals and patients suffering from diabetes. In both tables, sex is coded as 1 (male) and 2 (female). Merge the two data tables into one SPSS data file. Be aware of the different column headers in the tables. Compute body mass index (bmi = weight (kg)/height2 (m2)) and waist-hip-ratio (whr = waist/hip). Save the data set as “bmi.sav”. B.3 Malic acid Source: Timischl (2000). Two Excel tables contain the malic acid concentrations of samples of ten different commercially available apple juices. The same samples have been subjected to either enzymatical (“encymatical.xls”) or chromatographical (“chromatographical.xls”) measurement of the malic acid concentration. Merge both data tables. For each product, compute the difference of both measurement techniques (using the menu “Transform-Compute”). Save the data set as “malicacid.sav”. 165 35BData management with SPSS B.4 Cardiac fibrillation Source: Reisinger et al (1998). One-hundred and six patients suffering from ventricular fibrillation were treated using two different treatments. Patient data (V6=potassium, V7=magnesium, treat=treatment group, fibr_dur=duration of ventricular fibrillation in days) is contained in the Excel file “baseline.xls”. Sex is coded as 1 (male) and 2 (female). In a second Excel table (“results.xls”) you can find the variable V33 which is coded as 1 (successful treatment of fibrillation), and 0 (no success after 120 minutes of treatment). Merge the two tables into one SPSS data file. Compute the body mass index (bmi = weight (kg)/height2 (m2)). Save the SPSS data file as “fibrillation.sav”. 166 36BRestructuring a longitdudinal data set with SPSS C. Restructuring a longitdudinal data set with SPSS A longitudinal data set, i. e., a data set involving repeated measurements on the same subjects, can be represented in two formats: • • The ‘long’ format: each row of data corresponds to one time point at which measurements are taken. Each subject is represented by multiple rows. The ‘wide’ format: each row of data corresponds to one subject. Each of several serial measurements is represented by multiple columns. The following screenshots show the cervical pain data set in long … … and wide format: 167 36BRestructuring a longitdudinal data set with SPSS With SPSS, SAS and other statistics programs, it is possible to switch between these two formats. We exemplify the format switching on the cervical pain data set. Switching from wide to long We start with the data set cervpain-wide.sav as depicted above. From the menu, select Data-Restructure: 168 36BRestructuring a longitdudinal data set with SPSS Choose ‘Restructure selected variables into cases’, press ‘Next >’: 169 36BRestructuring a longitdudinal data set with SPSS Now the dialogue asks us whether we want to restructure one or several variable groups. In our case, we have only one variable with repeated measurements, but in general, one will have more than one such variable. Now we have to define the subject identifier. This is done by changing ‘Case group identification’ to ‘Use selected variable’, and moving ‘Patient ID’ into the field ‘Variable’: Then we have to define the columns which correspond to the repeatedly measured variable. We change ‘Target Variable’ to ‘VAS’ (directly writing into the field), and move the 6 variables labeled ‘Pain VAS week 1’ to ‘Pain VAS week 6’ into the field ‘Variables to be Transposed’: 170 36BRestructuring a longitdudinal data set with SPSS Please take care of the correct sequence of these variables. All other variables, which constitute the baseline characteristics, are moved to the field ‘Fixed Variables’: 171 36BRestructuring a longitdudinal data set with SPSS Then press ‘Next >’. The program now asks us to define an index variable. This variable is later used to define the time points of the serial measurements. Therefore, we could name it ‘week’. We only need one index variable: 172 36BRestructuring a longitdudinal data set with SPSS We request sequential numbers, and change the name to ‘week’: In the options dialogue that appears subsequently, we request to keep any variables that were not selected before as ‘fixed’ variables, and also to keep rows with missing entries. 173 36BRestructuring a longitdudinal data set with SPSS In the next dialogue, we are asked if we want to create SPSS syntax which does the restructuring of the data for later reference (SPSS syntax can be used to perform ‘automatic’ analyses, or to keep track of what we have done): After pressing ‘Finish’, the data set is immediately restructured into long format. You should save the data set now using a different name. Switching from long to wide We start with the data set in long format (cervpain-long.sav). Select DataRestructure from the menu, and choose ‘Restructure selected cases into variables’: 174 36BRestructuring a longitdudinal data set with SPSS Define ‘Patient ID’ as ‘Identifier Variable’ and ‘Week’ as ‘Index variable’: 175 36BRestructuring a longitdudinal data set with SPSS Next, we are asked if the data should be sorted (always select ‘yes’): The order of the new variable groups is only relevant, if more than one variable is serially measured. In our case, we have only the VAS scores as repeated variable. Optionally, one may also create a column which counts the number of observations that were combined into one row for each subject. 176 36BRestructuring a longitdudinal data set with SPSS Pressing ‘finish’, we obtain the data set in wide format. The VAS scores are automatically named ‘VAS.1’ to ‘VAS.6’. 177 37BMeasuring agreement D. Measuring agreement Agreement of scale variables An example (Bland and Altman, 1986): two devices measuring the peak expiratory flow rate (PEFR) are to be compared for their agreement. The devices are called „Wright peak flow meter” and “mini Wright meter”. The PEFR of 17 test persons was measured twice by each device: “wright1” denotes the first measurement on the Wright device, “wright2” the second measurement, “mini1” the first measurement on the Mini device, “mini2” the second measurement. First we restrict our agreement analysis to the first measurements of the Wright and Mini devices, and we generate a scatter plot of “wright1” and “mini1” to depict the association of the measurements: 178 37BMeasuring agreement The Pearson correlation coefficient for this data is r = 0.94. • • Can we conclude that both devices agree almost perfectly? What if the values measured by the Mini device were twice as high as those measured by the Wright device? The Pearson correlation coefficient is still r = 0.94. The Mini device now yields twice the value as before, but neither the scatter plot nor the correlation coefficient is sensitive to such a transformation. Therefore, these are no adequate tools to describe agreement of measurements. 179 37BMeasuring agreement Instead, Bland and Altman (1986) suggested to analyze the agreement of measurements by the following procedure: • • • For each subject, compute the difference of the two measurements and their mean Describe the distribution of the differences by mean and standard deviation or nonparametric measures (depending on their distribution) Generate a scatter diagram of the subject differences versus the subject means to see if the magnitude of the differences depends on the magnitude of the measurements. The subject means act as approximation to the true values which are to be measured by the devices. In our example, we have 17 subject differences. If we compute a mean of these 17 values, we obtain a measure of the mean deviation of the two methods. Computing a standard deviation of the 17 subject differences, we obtain a measure of the variation of the agreement. Even if the original measurements are not normally distributed, the distribution of the subject differences often is much closer to normal, justifying its description by mean and standard deviation. From the chapter on statistical measures we know that, assuming approximately normal distributions, the range of mean ± 2 standard deviations covers about 95% of the data. Thus, 95% limits of agreement can easily be computed. In our example we have the following mean and standard deviation: Subject difference Count 17 Mean 2.12 Std Deviation 38.77 The mean difference is 2.12 l/min, the 95% limits of agreement are 2.12 – 2 × 38.77 = -75.5 l/min and 2.12 + 2 × 38.77 = 79.7 l/min. Thus, the devices may deviate in a range of -75.5 to +79.7 l/min. We cannot speak of “fair agreement” in this example! The subject differences can be computed by choosing Transform-Compute Variable from the menu and fill in the fields as follows (don’t forget to define a label by clicking on Type & Label...: 180 37BMeasuring agreement Mean and standard deviation can be computed using the menu item Analyze-Tables-Custom Tables. A box plot or histogram of the subject differences can be generated using the instructions of chapter 2. Such a diagram helps us in deciding whether we can consider the subject differences as being approximately normally distributed or not. Similarly to the subject differences, the subject means are computed: 181 37BMeasuring agreement Bland and Altman (1986) also suggested to draw a so-called m-d-plot to describe the agreement of two measurement methods, i. e. a diagram plotting subject differences (d) against subject means (m). Additionally, such a plot should show the mean deviation (mean subject difference) and the limits of agreement: 182 37BMeasuring agreement This diagrams shows • • • the variation in agreement, a systematic bias (if one method tends to measure higher values than the other), and a potential dependence of the variation in agreement on the magnitude of the measured values In the latter case a transformation (e. g., a log transformation) of the measurements is indicated. The Bland-Altman-Plot can be generated using SPSS in the following way: 1. Compute subject differences and subject means as described above. Don’t forget do define labels for these new variables. 2. Generate a scatter plot of subject differences against subject means (Graphs-Chart Builder...): 183 37BMeasuring agreement 3. Now the mean subject difference, and the lower and upper limits of agreement can be inserted as so-called reference lines: a. Double-click the graph b. Click Options - Y Axis Reference Line. c. Enter the value -75.42 in the Position field, click on Apply and Close. d. Repeat b and c with the values 2.12 and 79.66: 184 37BMeasuring agreement M-d-plots can always be used to describe the distance between two distributions. Agreement of nominal or ordinal variables Example: efficacy of a treatment (data file “efficacy.sav”). 105 patients and their physicians were asked about the efficacy of a particular treatment. Three response categories were allowed: “very good”, “good”, “poor”. The following responses were obtained: Patient’s rating Very good Good Poor Very good Physician’s rating Good Poor 36 6 5 10 16 8 4 8 12 As a measure of agreement, we could compute the percentage of matching responses (in our case 64 out of 105 = 61%). This measure has the following disadvantage: assume, physicians’ and patients’ ratings were completely random. In this case we would still obtain a certain degree of agreement, as physicians’ and patients’ ratings will match in some cases 185 37BMeasuring agreement just randomly. In our example, the expected percentage of matching responses assuming random rating is 36%. This value can be computed similarly to the expected cell frequencies in a chi-square-test. The measure of agreement called kappa (greek letter κ) improves on the simple percentage of matching ratings by relating it to the expected percentage assuming random rating. It is defined by the ratio of the observed proportion of matching ratings (denoted by A) to one minus the expected proportion assuming random rating (1 – C). In other words: κ = (A – C)/(1 – C), with A and C denoting the observed proportion of matching ratings and the expected proportion of matching ratings assuming random rating. The κ measure can be computed by choosing the menu item AnalyzeDescriptive Statistics-Crosstabs, clicking on Statistics and selecting Kappa: Symmetric Measures Measure of Agreement N of Valid Cases Kappa Value Asymp. Std. Error(a) Approx. T(b) Approx. Sig. .390 .071 5.564 .000 105 a Not assuming the null hypothesis. b Using the asymptotic standard error assuming the null hypothesis. Under the null hypothesis of random rating, we can expect 36% matching ratings. We have observed 61% matching ratings. Kappa is thus computed as (0.61-0.36)/(1-0.36)=0.39. The significant p-values (<0.001) indicates that the assumption of random rating is not plausible and must be rejected. Note: Some programs offer the possibility to compute a weighted version of the Kappa measure, which assigns more weight to more serious mismatches (e. g., “very good” and “poor” are rated twice as much as “very good” and “good”). The weighted Kappa measure is not implemented in SPSS. 186 37BMeasuring agreement Exercises C.1 Source: Little et al (2002). Use the data file “Whitecoat.sav”, which contains two measurements of blood pressure for each of 176 patients. While the first measurement was taken in an outpatient department by a nurse, the second one was taken by a physician in a primary care unit (“white coat”). Is blood pressure higher if measured by a physician? Analyze the agreement of the two measurements. C.2 Source: Schwarz et al (2003). Perfusion was measured in fifty-nine dialysis patients by two different ultrasound methods (ultrasound dilution technique, UDT; color Doppler ultrasonography, CDUS). Data are collected in the data file “Stenosis.sav”. Evaluate the agreement of the two methods in a proper way. C.3 Source: Bakker et al (1999). Kidney volume was measured in 20 persons by two methods: ultrasound and magnetic resonance. (data file “Renalvolume.sav”). Both kidneys of each test person have been evaluated. Evaluate the agreement of the two methods in a proper way. C.4 Use the data file “CT_US.sav”, which contains measurements of tumor size taken by computer tomography (ct_mm), by ultrasound (us_mm) and measured histologically (hist_mm). Which measurements are closer to the histologically measured values, those by CT or those by US? C.5 Source: Fisher and van Belle (1993). Coronary arteriography is a key diagnostic procedure to detect narrowing or stenosis in coronary arteries. In the coronary artery surgery study (CASS) the quality of the arteriography was monitored by comparing patient’s clinical site readings with readings taken by a quality control site (which did only evaluate the angiographic films and did not see the patients). From these readings the amount of disease can be classified as “none” (entirely normal), “zero-vessel disease but some disease”, and one-, two- and three-vessel disease: 187 37BMeasuring agreement quality control site reading * clinical site reading Crosstabulation Count clinical site reading quality control site reading Total normal normal 13 some 6 One 1 two 0 three 0 20 some 8 43 9 2 0 62 one 1 19 155 18 11 204 Total two 0 4 54 162 27 247 three 0 5 24 68 240 337 22 77 243 250 278 870 Before opening the data file “CASS.sav”, reflect how to enter such data. Evaluate the agreement between clinical site readings and quality control site readings! 188 38BReference values E. Reference values Reference ranges or tolerance ranges are used in clinical medicine to judge whether a particular measurement taken from a patient (e. g., of a lab investigation) indicates a potential pathology because of its extreme value. When computing reference ranges, we are looking for limits which define a range of, say, 95% of the non-pathologic values. With symmetric limits, we can expect about 2.5% of patients to show a value above the upper limit and 2.5% below the lower limit. Reference ranges can also be computed as a function of time, e. g. growth curves for children. Reference ranges can be computed by parametric (assuming a normal distribution) or non-parametric statistics. Parametric reference ranges are simply computed by making use of mean and standard deviation (e. g., mean +/- 1.96 standard deviations defines a 95% reference range). However, parametric reference ranges should only be used if the data agree well with a normal distribution. Sometimes this can be achieved by transforming the original values using a log transformation: Y = log(X + C) with X and C denoting the original values and a suitable constant, and Y denoting the transformed values. C can be chosen such that the distribution of Y is as close to a normal distribution as possible, but it is restricted to values C > -min X to avoid taking logs of negative values, which are not defined. The 95% normal range [AY, BY] can be computed using mean and standard deviation of Y. It can simply be transformed to the original scale using the equations AX = exp(AY) – C and BX = exp(BY) – C Even with only small deviations from a normal distribution non-parametric measures (e. g., 2.5th and 97.5th percentiles) should be preferred. There are also hybrid methods, e. g. the method by Harrell and Davis (1982). As a minimum sample size to compute 95% reference ranges 100 and 200 have been proposed, depending on whether parametric or non-parametric methods can be applied (cf. Altman, 1991). In samples of such size it should be easy to judge whether normal distribution can be assumed or not, e. g. by a histogram (see chapter 2) or by a Q-Q-plot which is described below. 189 38BReference values Q-Q-Plot The normal assumption can be verified in two ways: • • formally by applying a statistical test (a test of normality) which assess significant deviations from the null hypothesis of normal distribution visually using diagrams The former of these alternatives offers the convenience of an automatic judgment, but it also has some drawbacks: with large samples (N>100) even irrelevant deviations from a normal distribution can lead to a significant test of normality (i. e., rejecting the normal assumption). On the other hand, with small samples (N<50) tests of normality may suffer from poor statistical power. In such samples departures from normal distributions have to be large enough to be detectable and to result in a significant test of normality. The Q-Q plot is a visual tool to answer the question whether a scale variable follows a normal distribution. It is more sensitive than the simple comparison of dot plot and error bar plot presented in chapter 2. The Q-Q plot contrasts original values, eventually after performing a logtransformation, to the quantiles of an ideal normal distribution. If the original values are normally distributed, then all dots in the plot lie on a straight line extending from the lower left to the upper right corner of the plot. The Q-Q plots are generated by choosing the menu item Graphs-Q-Q... (exemplified on the data set “Dallal.sav”): 190 38BReference values Detrended Normal Q-Q Plot of pretest Normal Q-Q Plot of pretest 10 250 5 Deviation from Normal Expected Normal Value 200 150 100 0 -5 50 -10 0 0 50 100 150 200 0 250 50 100 150 200 Observed Value Observed Value SPSS generates two different diagrams. The first one shows the Q-Q plot described above. The second one, called detrended normal Q-Q plot, visualized departures from normality in a fashion similar to the BlandAltman plot described earlier, i. e. differences between the observed values and the expected quantiles assuming a normal distribution are plotted against the observed values in order to show where the data are close to normal and where they are not. We see that the empirical distribution of the data departs from the normal particularly at its edges. The detrended Q-Q plot often pretends serious departures from the normal distributions, this is caused by the range of the y axis which is adapted to the largest deviation. Therefore, it should always be evaluated in conjunction with the normal Q-Q plot and it should not show any systematic departures from the straight line, as exemplified below: Detrended Normal Q-Q Plot of Marker1 400 100 300 75 Deviation from Normal Expected Normal Value Normal Q-Q Plot of Marker1 200 100 0 50 25 0 -100 -100 0 100 200 300 400 0 500 100 200 300 400 500 Observed Value Observed Value The C-shaped and U-shaped impressions of the Q-Q and detrended Q-Q plots, respectively, indicate that a log transformation might help in transforming the observed distribution into normal. 191 38BReference values If a log transformation is used, one may try the transformation Y = log(X + C) inserting various values for the constant C, and choose that one that yields the best approximation to a normal distribution. The visual impression of the Q-Q plot can also be quantified by a statistical measure. With a perfect normal distribution, all dots lie on a straight line, i. e. the Pearson correlation coefficient assumes the value 1. So we can compute the correlation coefficient from a Q-Q plot and use it as and indicator of a normal distribution if it is close enough to 1. Clearly, it will never be exactly equal to 1 in finite samples, but a value of 0.98 or higher can be assumed as satisfactory. Using the correlation coefficient, we can also judge which value of C is optimal inn transforming the empirical distribution into normal. As an example, consider the data set “Dallal.sav”. Assume we would like to evaluate the normal assumption on the variable “pretest” using a Q-Q plot. In order to compute the correlation coefficient from the Q-Q plot we must have access to the values that are depicted by that plot. Thus, we have to compute the theoretical normal distribution quantiles by our own. This can be done using the menu item Transform-Rank cases, selecting pretest as variable: After clicking on Rank Types... we select Normal scores: 192 38BReference values Afterwards, we compute the Pearson correlation coefficient (menu item Analyze-Correlation-Bivariate) of the original variable pretest and the normal scores. It is fairly high, assuming a value of 0.9986. 193 38BReference values The following table can be used to assess values of the Q-Q plot correlation coefficient for various sample sizes. This table was generated by simulating normally distributed samples of various sizes. If data come from a normally distributed population, then the stated values of the Q-Q correlation will be exceeded with a probability of 99%. With a sample size N > 100, which is suggested to compute parametric reference ranges, the Q-Q correlation coefficient should exceed 0.983. N 100 200 300 400 500 1000 Q-Q correlation coefficient >0.983 >0.992 >0.994 >0.995 >0.996 >0.998 Exercise D.1 Normal range. The data file “ALT.sav” contains measurements of the parameter ALT on a sample of 240 representative test persons. The data are already categorized, therefore they should be weighted by the frequency variable. Compute a parametric 95% reference range for ALT using an appropriate transformation! Try various values of C to yield a distribution as close as possible to a normal distribution. 194 39BSPSS-Syntax F. SPSS-Syntax SPSS Syntax Editor If similar analyses have to be repeated several times, it can be cumbersome to repeat choosing menu items using the mouse. Therefore, SPSS offers the possibility to perform analyses using the SPSS syntax language. This language is easy to learn because SPSS translates each command that is called by selecting particular menu items into SPSS syntax. The translations can be made visible by clicking on Paste in a menu: After clicking on Paste the corresponding SPSS syntax is pasted into the SPSS syntax editor window. Each SPSS syntax command begins with a key word (e. g., COMPUTE) and ends with a full stop. 195 39BSPSS-Syntax To run a syntax (several commands), click on Run. Now you can choose among: • • • • All: executes all commands in the syntax editor window. Selection: executes all (partly) selected commands. Current: executes the commands where the cursor is currently located To End: executes all commands from the current cursor location to the end. The contents of a syntax editor window can be saved to be recalled later, e. g. with other data sets, or to reanalyze a data set during a revision. If the same analysis should be rerun using various variables, one can save a lot of time using SPSS syntax. Assume we want to generate box plots for all scale variables in the data set “bmi.sav” (height, weight, abdominal girth, hip measurement, BMI), grouped by sex and age group. We just call the menu defining a box plot: 196 39BSPSS-Syntax Instead of clicking on OK, we click Paste. In the syntax window the command which generates a grouped box plot is displayed: Now this command can be selected, copied and pasted several times below. Each time we replace “weight” by the names of the other scale variables: 197 39BSPSS-Syntax GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=sex weight age_group[LEVEL=NOMINAL] MISSING=LISTWISE REPORTMISSING=NO /GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: sex=col(source(s), name("sex"), unit.category()) DATA: weight=col(source(s), name("weight")) DATA: age_group=col(source(s), name("age_group"), unit.category()) DATA: id=col(source(s), name("$CASENUM"), unit.category()) COORD: rect(dim(1,2), cluster(3,0)) GUIDE: axis(dim(3), label("Sex")) GUIDE: axis(dim(2), label("Weight")) GUIDE: legend(aesthetic(aesthetic.color), label("age_group")) SCALE: cat(dim(3), include("1", "2")) SCALE: linear(dim(2), include(0)) ELEMENT: schema(position(bin.quantile.letter(age_group*weight*sex)), color(age_group), label(id)) END GPL. GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=sex height age_group[LEVEL=NOMINAL] MISSING=LISTWISE REPORTMISSING=NO /GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: sex=col(source(s), name("sex"), unit.category()) DATA: weight=col(source(s), name("weight")) DATA: age_group=col(source(s), name("age_group"), unit.category()) DATA: id=col(source(s), name("$CASENUM"), unit.category()) COORD: rect(dim(1,2), cluster(3,0)) GUIDE: axis(dim(3), label("Sex")) GUIDE: axis(dim(2), label("Weight")) GUIDE: legend(aesthetic(aesthetic.color), label("age_group")) SCALE: cat(dim(3), include("1", "2")) SCALE: linear(dim(2), include(0)) ELEMENT: schema(position(bin.quantile.letter(age_group*weight*sex)), color(age_group), label(id)) END GPL. GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=sex waist age_group[LEVEL=NOMINAL] MISSING=LISTWISE REPORTMISSING=NO /GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: sex=col(source(s), name("sex"), unit.category()) DATA: waist=col(source(s), name("waist")) DATA: age_group=col(source(s), name("age_group"), unit.category()) DATA: id=col(source(s), name("$CASENUM"), unit.category()) COORD: rect(dim(1,2), cluster(3,0)) GUIDE: axis(dim(3), label("Sex")) GUIDE: axis(dim(2), label("Waist")) GUIDE: legend(aesthetic(aesthetic.color), label("age_group")) SCALE: cat(dim(3), include("1", "2")) SCALE: linear(dim(2), include(0)) ELEMENT: schema(position(bin.quantile.letter(age_group*waist*sex)), color(age_group), label(id)) END GPL. And so on … 198 39BSPSS-Syntax Now the cursor is located at the first of the commands, and we choose the menu item Run-To End. All box plots are generated. Assume we notice an outlier (id 4) with implausible values for height and bmi. This outlier can be removed and the analysis is simply repeated by again locating the cursor at the first IGRAPH command and choosing RunTo End. The Session journal The syntax of all commands that SPSS executes is collected in a file called the Session journal. This journal can be used to learn SPSS syntax language or to save analyses for later reference. The SPSS session journal is saved somewhere on the hard disk, depending on the installation. The folder where it is saved can be queried by choosing the menu item Edit-Options and selecting the File Locations view: 199 39BSPSS-Syntax Here, the location of the Session Journal can even be changed by choosing a different folder after clicking on Browse.... By default, new commands are appended to the existing journal. Generating a new Session Journal each time SPSS is invoked could be useful under certain circumstances. The Session Journal file can be opened using the Notepad editor (or the SPSS Syntax editor). Here we can also select and copy commands to use them in other SPSS programs. 200 40BExact tests G. Exact tests Recall example 5.2.1.: Comparison of therapies of Dr. X: standard therapy will be compared with new therapy, the outcome is binary (cured versus not cured). The research question is: Is there a difference between the cure rates between both therapies? cured yes no standard therapy 4 12 16 new therapy 9 9 18 13 21 34 A chi-square test was calculated: Chi-Square Tests Value df Asymp. Sig. (2-sided) 2,242b 1 ,134 Continuity Correctiona 1,308 1 ,253 Likelihood Ratio 2,286 1 ,131 Pearson Chi-Square Fisher's Exact Test Exact Sig. (2-sided) ,172 Linear-by-Linear Association 2,176 N of Valid Cases 34 1 Exact Sig. (1-sided) ,126 ,140 a Computed only for a 2x2 table b 0 cells (,0%) have expected count less than 5. The minimum expected count is 6,12. 201 40BExact tests We already know that under the null hypothesis the Pearson Chi-Square criterion for a 2×2 crosstable follows a chi-square distribution with one degree of freedom (“df”). For the observed value of the test statistic of 2.242 the p-value calculates to 0.134. When reading the output table, two issues arise. One of them concerns the terms „Asymptotical Significance” and “Exact Significance” and the other concerns the remark “0 cells (.0%) have expected count less than 5. The minimum expected count is 6.12.”. These will be investigated in greater detail now. Remember the expected number of patients in case of validity of the null hypothesis: expected, if null hypothesis applies cured yes no standard therapy 6.1 9.9 16 new therapy 6.9 11.1 18 13 21 34 202 40BExact tests If we only knew the marginal sums for example 5.2.1, then the following 14 different 2×2 crosstables were possible: 0 13 16 5 16 18 7 6 9 12 16 18 13 21 34 13 21 34 1 12 15 6 16 18 8 5 8 13 16 18 13 21 34 13 21 34 2 11 14 7 16 18 9 4 7 14 16 18 13 21 34 13 21 34 3 10 13 8 16 18 10 3 6 15 16 18 13 21 34 13 21 34 4 9 12 9 16 18 11 2 5 16 16 18 13 21 34 13 21 34 5 8 11 10 16 18 12 1 4 17 16 18 13 21 34 13 21 34 6 7 10 11 16 18 13 0 3 18 16 18 13 21 34 13 21 34 Notice that it is sufficient to know one field of this 2×2 crosstable in case of fixed margins. We choose the upper left field (this corresponds to the number of cured patients under standard therapy). 203 40BExact tests Assume, the null-hypohtesis is true (=both therapies work equally well), then we could calculate the probability to observe any potential cross table just by chance. For those who are interested: the probabilities are calculated using the hypergeometric distribution. X 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Pearsons Chi-square criteria 18.7 13.1 8.5 4.9 2.2 0.62 0.0069 0.39 1.8 4.2 7.5 11.9 17.3 23.7 probability (under the assumptions of the null hypothesis) 0.0000092 0.0003201 0.0041152 0.0264062 0.0953555 0.2059680 0.2746240 0.2288533 0.1188277 0.0377231 0.0070416 0.0007202 0.0000353 0.0000006 X denotes the value of the upper left cell of the 2×2 cross table. We see that X=5,6,7 are most likely. This is not surprising, as these cross tables correspond much with the expected cross table under the null hypothesis. Pearsons Chi-square criteria for every potential cross table is given in the middle column. For X=6 the smallest value results, for X=7 the second smallest value result and so on (usw.). In example 5.2.1 we have observed a cross table with X=4. We can now use the probability distribution shown above to determine an exact pvalue. For this we have to add the probabilities of all cross tables whose Chi-square criteria are equal to or greater than 2.2 (shaded in grey). We obtain an exact p-value of 0.1717. This procedure is also known as Fisher’s exact test. The SPSS-Output from example 5.2.1. confirms our calculation. 204 40BExact tests Remark: The SPSS-Output gives a p-value, labelled „Exact Sig. (1-sided)“, of 0.126 for the Fisher’s exact test for example 5.2.1. From our calculations above we can understand how SPSS calculates this value: the probabilities of the table of page 204 were summed from X=0 to 4. Note, the one-sided alternative hypothesis is generated and tested automatically based on the observed data. This is an impure scientific procedure. So, one-sided p-values shouldn’t be presented automatically in this way. 0.30 Probability function 0.25 0.20 0.15 0.10 0.05 0.00 0 2 4 6 8 10 12 14 Chi-Square criterion 16 18 20 22 24 Graphical presentation of the probability function of the Chi-square criteria from example 5.2.1 under the assumption that the null hypothesis is true. It can be argued why we obtain an asymptotical p-value of 0.134 for the Chi-square test, if the exact p-value is 0.1717 for example 5.2.1? For the time being: The asymptotical p-value (which is based on the Chi-square distribution) is an approximation of the exact p-value. This approximation becomes better (i. e., more precise) with increasing sample size, like most of the approximations in statistics. 205 40BExact tests 1.0 Distribution function 0.8 0.6 0.4 0.2 0.0 0 2 4 6 8 10 12 14 Chi-Square criterion 16 18 20 22 24 Graphical presentation of the asymptotic (dashed line) and exact (solid line) cumulative density functions of the Chi-square criteria for example 5.2.1 under the assumption that the null hypothesis is true. This approximation of the exact distribution by the asymptotic distribution seems to be satisfying. Nevertheless, why do we use the asymptotic pvalue if the exact p-value is available? Answer: Though the term "exact" gives the impression, that it is the better (i. e., the more precise) version of the test, we have to take two arguments into account. First, exact tests require in general special computer programs and powerful computers. Second, exact tests are conservative due to the discrete test statistic. In its statistical meaning, a conservative test adheres to the null hypothesis in too many situations. Or in other words, the accepted error probability (the significance level) is not completely used. 206 40BExact tests A rule of thumb for the decision between asymptotic and exact versions of the Chi-square test is • If all expected cell frequencies (under the null hypothesis) are higher than 5, then the asymptotic p-value can be used. Otherwise, use the exact p-value. For example 5.2.1. the smallest expected cell frequency is 6.1. Sometimes another rule of thumb is used as additional step: • If the sample size is smaller than 60, then use the asymptotic p-value corrected by Yates. This version is also named continuity-corrected Chi-square test. In the SPSS output it can be found in the row ”Continuity Correction“. For example 5.2.1. a continuity-corrected p-value of 0.253 is calculated. Are there other exact tests? Yes. In theory, exact p-values could be calculated for any statistical test. However, in practice only non-parametric tests are calculated in an exact version, e.g. the already known Wilcoxon-Mann-Whintey U Test. The procedure is similar to the algorithm shown for example 5.2.1. However, for problems exceeding the a test for a 2x2 table, the required computing effort can be considerable. Nowadays, more and more software packages offer exact tests (like SPSS, SAS, …). However, for many problems one has to switch over to other special software. For very small sample sizes ⇒ use exact instead of asymptotic p-values!! 207 41BEquivalence trials H. Equivalence trials Most of the clinical studies are designed to detect differences between two treatments. Sometimes one would like to show that a certain therapy (NT) is equal to the standard therapy (ST). Such studies are called equivalence trials. Reasons for such studies could be that NT has less side effects, is cheaper or easier to apply. Equivalence trials are also called similarity trials. This second expression refers to the circumstance that exact similarity of two treatments can never be shown, even if they are equivalent in reality. What we can show is “sufficient similarity” of the treatments. First, we have to consider what „sufficient similarity“ means. It doesn’t bother us if NT works better than ST. If NT=ST then everything is alright. Even if NT is slightly worse than ST we find NT acceptable. Thus we have the situation of a one-sided hypothesis. But, what means „slightly worse“? To define it, we need clinical considerations which difference is still acceptable. This smallest acceptable difference (derived from clinical rationale) is abbreviated by Θ0. Example: Comparison of cure rates, Θ0 =0.03 Null hypothesis: ST is at least 3 per cent points better than NT, ST-NT≥0.03, (no equivalence – it is again the negation of the research question!) Alternative hypothesis: ST is either worse, equal or maximal 3 percent points better than NT ST-NT<0.03 (equivalence) Solution: one-sided test based on a significance level of 5 % or one-sided 95 % confidence interval or two-sided 90 % confidence interval Note, Type 1 error: Falsely claim equivalence although the null hypothesis is true. Type 2 error: Falsely claim no equivalence although the alternative hypothesis is true. 208 41BEquivalence trials If we want to show that the effects of two therapies are not too different in both directions, we have the situation of a two-sided research question. First, we have to define two smallest acceptable differences (one for each deviation in both directions - above and below). For this we define two one-sided null hypotheses: E.g. comparison of cure rates, Θ01 =0.03, Θ02 =0.07 Null hypothesis 1: ST is at least 3 per cent points better than NT ST-NT≥0.03 (no equivalence) Null hypothesis 2: NT is at least 7 per cent points better than ST NT-ST≥0.07 (no equivalence) Alternative hypothesis: ST is maximal 3 per cent points better than NT, equaly well or maximal 7 per cent points worse than NT -0.07<ST-NT<0.03 (equivalence) Only when both one-sided null hypothesis are rejected based on the 5 % significance level we can assume equivalence. Two-sided hypotheses occur primarily for bioequivalence/bioavailability trials with pharmacokinetic or pharmacodynamic questions behind. Remark: One can also make sample size calculations for equivalence trials. The “better” sample size calculation programs offer separate menu items for that. 209 42BDescribing statistical methods for medical publications I. Describing statistical methods for medical publications If statistical methods have been used in a medical study then they must be described adequately in the resulting research paper. The “statistical methods” section is usually positioned at the end of the “material and methods” chapter (before the “results” chapter). The following principles should be observed: • On the one hand: descriptions of the statistical principles should be short and precise, as the medical aspects are in the foreground in medical manuscripts. • On the other hand: The description should be detailed enough. In other words: following the description of the statistical methods section and using the same set of data, all results should be reproducible by an independent researcher. • Empirical results do not belong to the statistical methods section. How should a statistical methods section be organized? • Description of any descriptive measures, data transformations, etc. that were used. • Description of any statistical tests, statistical models, methods for multiplicity adjustments, that were applied. • Description of the software used. • Description of the significance level used and the type of alternative hypotheses tested (one- or two-sided). Example (statistical methods section for the rat diet example 4.1.1): Weight gain in both groups was described by mean and standard deviation. Differences between both groups were assessed by unpaired t-test. SPSS statistical software system (SPSS Inc., Chicago, IL) was used for statistical calculations. The reported p-value is a result of a two-sided test. A p-value smaller or equal to 5 % is considered statistically significant. 210 43BDictionary: English-German J. Dictionary: English-German English German Remarks adjusted R-squared measure korrigiertes R-Quadrat See also: coefficient of determination alternative hypothesis Alternativhypothese analysis of variance Varianzanalyse Abbr.: ANOVA ANOVA ANOVA Abbr. for: ANalysis Of VAriance arithmetic mean arithmetisches Mittel coefficient of determination Bestimmtheitsmaß, R2, R-Quadrat confidence interval Konfidenzintervall correlation Korrelation degree of freedom Freiheitsgrad dependent variable abhängige Variable distribution Verteilung estimation Schätzung exact test exakter Test geometric mean geometrisches Mittel independent unabhängig independent variable unabhängige Variable interaction Wechselwirkung leverage point einflussreiche Beobachtung (einflussreicher Punkt, Hebelwert) linear regression lineare Regression See also: R-squared measure Abbr.: df Clearly, we intend the statistical meaning (not the colloquial) here 211 Such an observation has a potentially large „leverage effect“ on the regression line 43BDictionary: English-German logarithm Logarithmus mean Mittelwert median Median multiple comparison problem Multiplizitätsproblem multiple testing problem Multiplizitätsproblem nonparametric test nichtparametrischer Test null hypothesis Nullhypothese one-sided test, one-tailed test einseitiger Test one-way ANOVA einfaktorielle Varianzanalyse outcome, outcome variable Zielgröße outlier Ausreißer paired test gepaarter Test percent Prozent percentage point Prozentpunkt population Grundgesamtheit power Mächtigkeit probability Wahrscheinlichkeit prognostic factor Prognosefaktor p-value p-Wert quartile Quartile random experiment Zufallsexperiment range Spannweite regression model Regressionsmodell 212 43BDictionary: English-German residual Residuum response variable Zielgröße R-squared measure Bestimmtheitsmaß, R2, R-Quadrat sample Stichprobe sample size Fallzahl, Stichprobengröße sensitivity Sensitivität sign test Vorzeichentest significance level Signifikanzniveau skewed distribution schiefe Verteilung smallest difference important to detect minimal klinisch relevante Alternative specificity Spezifizität standard deviation Standardabweichung standard error Standardfehler statistically significant statistisch signifikant summary measure problemorientierter Parameter test Test t-test t-Test also: Student's t-test ("Student" was the pseudonym of W.S. Gosset, the inventor of the t-test) two related samples zwei verbundene Stichproben SPSS-Jargon two-sided test, two-tailed test zweiseitiger Test unpaired test ungepaarter Test 213 Abbr.: R2 The standard error is the standard deviation of the mean 43BDictionary: English-German variable Variable variable transformation Transformation einer Variablen variance Varianz Wilcoxon rank sum test Wilcoxon Rangsummentest Wilcoxon signedrank test Wilcoxon Vorzeichen-Rangtest 214 43BDictionary: English-German References Citations in bold typeface are introductory books in agreement with the contents of this lecture notes which may serve as supplemental material to the student. Other citations are either interesting further reading, specialized statistics books or sources of examples used in these lecture notes. • D.G. Altman (1991): Practical Statistics for Medical Research. Chapman and Hall, London, UK. • D.G. Altman (1992): Practical statistics for medical research. Chapman and Hall. • D. F. Andrews and A. M. Herzberg (1985): Data - A Collection of Problems from many Fields for the Student and Research Worker. Wiley, New York. • Ralf Bender, Stefan Lange (2001): Adjusting for multiple testing— when and how? Journal of Clinical Epidemiology 54(4), 343–349. • M. Bland (1995): An Introduction to Medical Statistics. Second Edition. Oxford University Press. • J.M. Bland and D.G. Altman (1986): Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, 1:307-310. • A. Bühl (2006): SPSS Version 14. Einführung in die moderne Datenanalyse. 10. überarbeitete Auflage. Pearson Studium (German). • M J. Campbell and D. Machin (1993): Medical Statistics - A Commonsense Approach. John Wiley & Sons, New York. • B. Dawson and R.G. Trapp (2004): Basic & Clinical Biostatistics. Fourth Edition. McGraw-Hill. • A.R. Feinstein, D.M. Sosin and C.K. Wells (1985): The Will Rogers phenomenon. Stage migration and new diagnostic techniques as a source of misleading statistics for survival in cancer. The New England Journal of Medicine, 312(25), 1604-1608. • L.D. Fisher and G. van Belle (1993): Biostatistics - Methodology for the Health Sciences. Wiley, New York. • R.H. Fletcher, S.W. Fletcher (2005): Clinical Epidemiology. The Essentials. Fourth Edition. Lippincott Williams & Wilkins. (A german version of the book appeared in 2007 as "Klinische Epidemiologie: Grundlagen und Anwendung. 2. Auflage, Huber, Bern".) • R.J. Freund and P.D. Minton (1979): Regression Methods. A Tool for Data Analysis. Marcel Dekker. 215 43BDictionary: English-German • G. Gigerenzer (2002): Das Einmaleins der Skepsis. Über den richtigen Umgang mit Zahlen und Risiken. Berlin Verlag (German). • I. Guggenmoos-Holzmann und K.-D. Wernecke (1995): Medizinische Statistik. Blackwell Wiss.-Verlag (German). • F.E. Harrell and C.E. Davis (1982): A new distribution-free quantile estimator. Biometrika, 69:635-640. • R.-D. Hilgers, P. Bauer und V. Scheiber (2003): Einführung in die Medizinische Statistik. Springer-Verlag (German). • S. Holm (1979): A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics, 6, 65-70. • J.C. Hsu (1996): Chapman and Hall. • M.H. Katz (1999): Multivariable Analysis. A Practical Guide for Clinicians. Cambridge University Press. • D. Kendrick, K. Fielding, E. Bentley, R. Kerslake, P. Miller, and M. Pringle. Radiography of the lumbar spine in primary care patients with low back pain: randomised controlled trial. British Medical Journal, 322:400-405, 2001. • S. Landau and B.S. Everitt (2004): A Handbook of Statistical Analyses using SPSS. Chapman & Hall/CRC. • P. Little, J. Barnett, L. Barnsley, J. Marjoram, A. Fitzgerald-Barron, and D. Mant. Comparison of agreement between different measures of blood pressure in primary care and daytime ambulatory blood pressure. British Journal of Medicine, 325:254-257, 2002 • R. J. Lorenz. Grundbegriffe der Biometrie. G.Fischer, Stuttgart, 1988 (German). • D. Machin, M. Campbell, P. Fayers and A. Pinol (1997): Sample Size Tables for Clinical Studies. 2nd Edition. Blackwell Science. • D.E. Matthews and V.T. Farewell (1988): Using and Understanding Medical Statistics. 2nd, revised edition. Karger. • R. Matthews (Translated by J. Engel): Der Storch bringt die Babies zur Welt (p=0.008). Stochastik in der Schule 21: 21-23, 2001 • H. Motulsky (1995): Intuitive Biostatistics. Oxford University Press. • J. Pallant (2005): SPSS survival manual. 2nd edition. Open University Press. • G. v. Randow (1992): Das Ziegenproblem. Denken in Wahrscheinlichkeiten. Rowohlt Verlag (German). Multiple Comparisons. 216 Theory and methods. 43BDictionary: English-German • J. Reisinger, E. Gatterer, G. Heinze, K. Wiesinger, E. Zeindlhofer, M. Gattermeier, G. Poelzl, H. Kratzer, A. Ebner, W. Hohenwallner, K. Lenz, J. Slany, and P. Kuhn. Prospective comparison of flecainide versus sotalol for immediate cardioversion of atrial fibrillation. Am J Cardiol, 81:1450-1454, 1998. • M.F. Schilling, A.E. Watkins, and W. Watkins. Is human height bimodal? The American Statistician, 56:223-229, 2002. • M. Schumacher und G. Schulgen (2002): Methodik klinischer Studien. Methodische Grundlagen der Planung, Durchführung und Auswertung. Springer-Verlag (German). • C. Schwarz, C. Mitterbauer, M. Boczula, T. Maca, M. Funovics, G. Heinze, M. Lorenz, J. Kovarik, and R. Oberbauer. Flow monitoring: Performance characteristics of ultrasound dilution versus color doppler ultrasound compared with fistulography. American Journal of Kidney Diseases, 42:539-545, 2003.J. Bakker, M. Olree, R. Kaatee, E. E. de Lange, K. G. M. Moons, J. J. Beutler, and F. J. A. Beek. Renal volume measurements: Accuracy and repeatability of us compared with that of mr imaging. Radiology, 211:623-628, 1999. • J.P. Shaffer (1986): Modified Sequentially Rejective Multiple Test Procedures. Journal of the American Statistical Association, 81(395), 826-831. • W. Timischl. Biostatistik. Springer, Wien, 2000 (German). 217