Basic Biostatistics Workshop Using SPSS 2019 Lesson 1 Descriptive Statistics Dr. Mohamad Rodi Isa MBBS (Malaya), DAP&E (SEAMEO-TROPMED, M’sia), MPH (Malaya), DrPH (Malaya) Public Health Medicine, UiTM MdRodiSPSS 1 Contents Recap Introduction to statistics. Lesson 1 Descriptive Statistics. Lesson 2 Inferential Statistics: Estimation & Hypothesis Testing. Lesson 3 Analyzing Quantitative Data: T-test. Lesson 4 Analyzing Quantitative Data: Analysis of Variance (ANOVA). Lesson 5 Analyzing Categorical Data. Lesson 6 Correlation Analyses. Lesson 7 Simple Linear Regression. Lesson 8 Simple Logistic Regression Lesson 9 Non-Parametric Method Data Analyses. MdRodiSPSS 2 Descriptive Statistics • Are numerical values obtained from the sample that gives meaning to the data collected. • It consists of methods for organizing, displaying and describing data using tables, graphs and summary measures. • Descriptive statistics consists of: i. Frequency and proportion ii. Measures of central tendency. iii. Measures of dispersion. • Two important concepts to understand descriptive statistics are: i. Variables. ii. Distribution.. MdRodiSPSS 3 Descriptive Statistics • Involves • Collecting Data • Presenting Data • Characterizing Data • Purpose: • Describe Data. • Descriptive statistics: • Simplifying, summarizing, describing data.. MdRodiSPSS 4 Descriptive Statistics Nominal Categorical Ordinal Variables Discrete Numerical Intervals Continuous Ratio MdRodiSPSS 5 Categorical Data Analysis Presentation: • Percentage. • Frequency • Relative frequency • Cumulative relative frequency • Proportion. Graphical: • Bar chart. • Pie chart. • Pareto diagram.. MdRodiSPSS 6 i. Frequency Question: How to get frequency for the “race” from SPSS ? Analyze >> Descriptive >> Frequency MdRodiSPSS 7 Hands-on: Percentage Describe data for race of respondents 1 2 3 4 5 MdRodiSPSS 8 Output: Percentage MdRodiSPSS 9 ii. Cross-tabulation • Cross tabulation is a method to quantitatively analyze the relationship between multiple variables. • It is also known as contingency tables or cross tabulation groups variables to understand the association between different variables • It is usually performed on categorical data – data that can be divided into mutually exclusive groups. Example: • How to cross tabulate between Gender and ethnicity. Analyze >> Descriptive >> Crosstabs MdRodiSPSS 10 Hands-on: Cross-tabulation Describe data for race of respondents stratified by gender 1 2 4 3 MdRodiSPSS 6 5 11 Hands-on: Percentage Describe data for race of respondents stratified by gender 7 You can change to “ROW” percentages 8 MdRodiSPSS 9 12 Output: Percentage 1) Percentage by ROW 2) Percentage by COLUMN MdRodiSPSS 13 Hands-on: Simple Bar Chart Describe data for race of respondents 1 4 2 3 5 6 MdRodiSPSS 14 Hands-on: Simple Bar Chart Describe data for race of respondents 7 8 MdRodiSPSS 9 15 Output: Simple Bar chart MdRodiSPSS 16 Hands-on: Clustered Bar Chart Describe data for race of respondents stratified by gender 1 4 2 3 5 6 MdRodiSPSS 17 Hands-on: Clustered Bar Chart Describe data for race of respondents stratified by gender 9 7 8 MdRodiSPSS 10 18 Output: Clustered Bar chart MdRodiSPSS 19 Hands-on: Pie Chart Describe data for race of respondents 1 4 2 5 3 MdRodiSPSS 20 Hands-on: Pie Chart Describe data for race of respondents 7 6 MdRodiSPSS 8 21 Output: Pie chart MdRodiSPSS 22 Exercise: • Describe data for: i. Gender ii. Educational level iii. Medication type iv. Heart disease status MdRodiSPSS 23 Numerical Data Analysis 1) Presentations: • Measures of central tendency. • Measures of dispersion (variability). 2) Graphical: • Histogram. • Stem & Leaf plot. • Box and whisker plot. • Many more … MdRodiSPSS 24 Explore • i. ii. iii. It is the first step in the analytic process: to explore the characteristics of the data. to screen for error and correct them. to look for distribution patterns – normal distribution or not. • It may require transformation before further analysis using parametric methods. or may need analysis using non-parametric technique.. • MdRodiSPSS 25 Hands-on: Explore Describe data for weight of respondents 1 2 4 3 5 MdRodiSPSS 26 Output: Explore MdRodiSPSS 27 Hands-on: Explore (stratify) Describe data for weight of respondents stratified by gender 1 2 4 3 5 6 MdRodiSPSS 28 Output: Explore (stratify) • Mean weight for male: 69.55 (SD: 2.67) • Mean weight for female: 70.07 (SD: 2.66) MdRodiSPSS 29 Hands-on : Histogram Describe data for weight of respondents 1 4 5 2 6 MdRodiSPSS 3 30 Output: Histogram Curve MdRodiSPSS 31 Exercise: Explore: i. Weight ii. Age iii. Height iv. Body mass index MdRodiSPSS 32 Answer: Summary Variables Weight Age Height BMI Mean (SD) 69.81 (2.66) 31.20 (2.85) Median (IQR) 72.50 (14)* 1.47 (0.12)** - * Data skewed to the left ** Data skewed to the right MdRodiSPSS 33 Scales of measurement Categorical (Qualitative) Variables Numerical (Quantitative) Nominal – The assignment of Numbers for Classification purposes: Eg. Gender, Blood group Ordinal - Quantitative Values Providing a Classification According to order or Magnitude Eg. Educational status Discrete - values are countable that only certain values with no intermediate values (OR only whole number). Eg. number of children (1, 2, 3, 4, ....). Continuous MdRodiSPSS Intervals – Classification according to a continuous with interval equality & subdivision sensibility Eg. Temperature Ratio – Interval data with an absolute value of 0 34 Eg. Height, weight Thank you for your attention rodi@salam.uitm.edu.my MdRodiSPSS 35 Basic Biostatistics Workshop Using SPSS 2019 Lesson 2 Inferential Statistics: Estimation & Hypothesis Testing Dr. Mohamad Rodi Isa MBBS (Malaya), DAP&E (SEAMEO-TROPMED, M’sia), MPH (Malaya), DrPH (Malaya) Public Health Medicine, UiTM MdRodiSPSS 36 Contents Recap Introduction to statistics. Lesson 1 Descriptive Statistics. Lesson 2 Inferential Statistics: Estimation & Hypothesis Testing. Lesson 3 Analyzing Quantitative Data: T-test. Lesson 4 Analyzing Quantitative Data: Analysis of Variance (ANOVA). Lesson 5 Analyzing Categorical Data. Lesson 6 Correlation Analyses. Lesson 7 Simple Linear Regression. Lesson 8 Simple Logistic Regression Lesson 9 Non-Parametric Method Data Analyses. MdRodiSPSS 37 Inferential Statistics • The application of analytic procedures on the sample of population to infer (generalize) the results obtained to the target population. • It is a techniques, where: i. inferences are drawn for the population parameter from the sample statistics; OR ii. sample statistics observed are inferred to the corresponding population parameters. • It analysis infers properties about a population, this includes: i. Estimation / Confidence intervals - Point estimation (maximally likely value for parameter). - Interval estimation (also called confidence intervals for parameter).. MdRodiSPSS ii. Hypothesis testing tests of significance.. 38 Inferential Statistics • The application of analytic procedures on the sample of population to infer (generalize) the results obtained to the target population. Random sampling: Every member of the population has the same chance of being selected in the sample Target Population Parameter Observed sample Statistics MdRodiSPSS Estimation 39 (inference about the population) Statistics Population sampling Inferential • Describe the characteristics of the sample • eg: i. Sex : % male, % female ii. Age : mean (sd) iii. Race : % Malay, % Chinese, % Indian Sample MdRodiSPSS Descriptive40 GOLDEN RULE It is never about the sample It is ALWAYS about the population MdRodiSPSS 41 1. Estimation & Confidence Intervals • It refers to the process by which one makes inferences about a population based on information obtained from a sample. • It use sample statistics to estimate the population parameter. • Example: i. sample means are used to estimate population means. ii. sample proportion to estimate population proportions. • An estimate of a population presented in two ways: i. Point estimate: - a point estimate of a population is a single value of statistic. ii. Intervals estimate: - an intervals that have two numbers, between which a population parameter is said to lie.. MdRodiSPSS 42 How to calculate Confidence Intervals • The calculation of confidence intervals based on: i. The standard deviation of the population - known or unknown. ii. The number of sample - more or less than 30. iii. The level of confidence - set by the researcher. • Knowing (i) and (ii) are to determine the decision of choosing either using: i. t-score; or ii. z-score.. MdRodiSPSS 43 t-score versus z-score How to calculate the range of value of certain value to be the BMI and prevalence of obesity to all Shah Alam population? Do you know the POPULATION STANDARD DEVIATION, σ? No Yes Is the sample size above 30 Yes MdRodiSPSS No Use the Use the z-score t-score 44 t-score versus z-score 1) When the σ in UNKNOWN OR the σ is KNOWN but sample < 30: (using t-score) • The value of critical value t1/2 will be determined from t-table based on: i. Level of significance. ii. Degree of freedom. iii. One or two-sided. 2. When the σ in KNOWN and sample > 30: (using z-score) • confidence coefficient (Z) will be determined from standard normal distribution table based on the degree of the confident. • When the degree of confidence is: i. 90%, z is 1.64. ii. 95%, z is 1.96. MdRodiSPSS 45 iii. 99%, z is 2.58.. t-score versus z-score when the σ in UNKNOWN OR σ in KNOWN but sample < 30 When the σ in KNOWN + sample > 30 General formula Point estimate ± (critical value * SE) Point estimate ± (confidence coefficient * SE) mean interval Point estimate ± [ t½ * (SD/ √n) ] Point estimate ± [ z * (SD/ √n) ] Proportion interval Point estimate ± [ t½ * ( √ p (1 – p) / n ) ] Point estimate ± [ z * ( √ p (1 – p) / n ) ] MdRodiSPSS 46 Estimation & Confidence Intervals Research question: • What is the mean (with 95% confidence intervals) of Body Mass Index (BMI) and the prevalence of Tuberculosis in Shah Alam? i. ii. Body Mass Index – to estimate mean and 95% confidence intervals Tuberculosis – to estimate proportion and 95% confidence intervals MdRodiSPSS 47 Hands-on: Estimation (Mean) Calculate the mean of Body Mass Index and 95%CI 1 2 4 3 5 MdRodiSPSS 48 Output: Estimation (Mean) • The mean of BMI : 31.20 (95%CI: 30.64, 31.75) MdRodiSPSS 49 Hands-on: Estimation (Mean) Calculate the mean of Body Mass Index and 95%CI stratified by Gender 1 2 4 3 5 6 MdRodiSPSS 50 Output: Estimation (Mean) • The mean BMI for Male : 30.61 (95%CI: 29.79, 31.43) • The mean BMI for Female: 31.80 (95%CI: 31.09, 32.52) MdRodiSPSS 51 Hands-on: Estimation (Proportion) Calculate the prevalence of Tuberculosis 1 2 3 Tips (Coding) must be: • 0 for No • 1 for Yes 4 5 MdRodiSPSS 52 Output: Estimation (Proportion) The prevalence of Tuberculosis : 43.4% (95%CI: 33.8, 53.0) MdRodiSPSS 53 Hands-on: Estimation (Proportion) Calculate the prevalence of Tuberculosis stratified by Gender 1 2 3 Tips (Coding) must be: • 0 for No • 1 for Yes 4 5 6 MdRodiSPSS 54 Output: Estimation (Proportion) • The prevalence of Tuberculosis for Male : 31.5% (95%CI: 18.7, 44.3) • MdRodiSPSS The prevalence of Tuberculosis for Female : 55.8% (95%CI: 41.8. 69.7)55 2. Hypothesis testing • Hypothesis testing: - It express the degree of accuracy of sample results to represent the true situation in a population. • Purpose: - To aid the researcher in reaching a conclusion concerning the population by examining a sample from the population. • This is how we decide if: - Effect actually occurred. - Treatment have effects. - Groups different from each other. - One variable predicts another.. MdRodiSPSS 56 Types of Hypothesis 1) Null hypothesis (Ho) • Hypothesis of no difference. • No difference or relationship between the variable of interest. 2) • • • Alternate Hypothesis (Ha) Hypothesis that contradict null hypothesis Can indicate direction of the difference or relationship Also called research hypothesis.. MdRodiSPSS 57 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 58 Thank you for your attention rodi@salam.uitm.edu.my MdRodiSPSS 59 Basic Biostatistics Workshop Using SPSS 2019 Lesson 3 Analyzing Quantitative Data: T-test Dr. Mohamad Rodi Isa MBBS (Malaya), DAP&E (SEAMEO-TROPMED, M’sia), MPH (Malaya), DrPH (Malaya) Public Health Medicine, UiTM MdRodiSPSS 60 Contents Recap Introduction to statistics. Lesson 1 Descriptive Statistics. Lesson 2 Inferential Statistics: Estimation & Hypothesis Testing. Lesson 3 Analyzing Quantitative Data: T-test. Lesson 4 Analyzing Quantitative Data: Analysis of Variance (ANOVA). Lesson 5 Analyzing Categorical Data. Lesson 6 Correlation Analyses. Lesson 7 Simple Linear Regression. Lesson 8 Simple Logistic Regression Lesson 9 Non-Parametric Method Data Analyses. MdRodiSPSS 61 Student t-test • A t-test is any statistical hypothesis testing in which the test statistic follow a Student’s t-distribution. • It is one of the probability distributions used in statistics when dealing with continuous random variables and follow a standard normal distribution. • It is used for hypothesis testing involving numerical data (comparing means). • It is almost similar to the Standard Normal Distribution (zdistribution). MdRodiSPSS 62 Types of t-test T-test MdRodiSPSS Group T-test Comparing mean of one group 1-sample t-test Comparing mean of pair group Dependent t-test Comparing means of two independent groups Independent t-test 63 3.1 One sample t-test • It helps to determine whether μ (the population mean) is equal to a hypothesized value (the test mean). • The test uses the standard deviation of the sample (s) to estimate the standard deviation of the population (σ).. • The hypotheses are specified about a single distribution.. MdRodiSPSS 64 Assumptions i. The outcome must be a continuous variable (numerical variable i.e. interval or ratio) ii. The sample is selected by random sampling - each individual in the population has an equal probability of being selected in the sample. iii. The tested data is normally distributed (i.e. no outliers) OR sample size is big (≥30). iv. Scores on the test variables are independent (i.e. independent of observations).. MdRodiSPSS 65 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 66 One sample t-test Research question: • You want to determine whether the weight of the respondents is statistically different from 70kg. Step 1: Specify the Ho and Ha. a) Null hypothesis: • Ho: The μ is equal to 70 kg. b) Alternate hypothesis: • Ha: The μ is NOT equal to 70kg (or far from 70kg). Step 2: Choose the significant level. • α = 0.05 (two-sided). MdRodiSPSS 67 One sample t-test • Step 3: Checking assumptions. i. The data is a continuous variable (numerical variable i.e. interval or ratio). ii. The sample is selected by random sampling - each individual in the population has an equal probability of being selected in the sample. iii. The tested data is normally distributed (i.e. no outliers) OR sample size is big (≥30). iv. Scores on the test variables are independent (i.e. independent of observations). • Step 4: Choose the test statistic. - One sample t-test with (n – 1) df.. MdRodiSPSS 68 One sample t-test • Step 5: Find p value. - to calculate t-calc.. • Formula: • • • • X : mean from our sample μ : our hypothesized mean s : standard deviation from our sample n : number of our sample MdRodiSPSS 69 Hands-on: One sample t-test 1 4 2 3 5 6 MdRodiSPSS 70 Output: One sample t-test • The p-value is 0.455 which is more than alpha (0.05). MdRodiSPSS 71 Conclusion Step 6: Conclusion a) Statistical conclusion: • Since the p-value (p=0.455) is more than α (0.05), we DO NOT reject Ho. • Therefore, we can conclude that there is no significant difference that the mean weight of the respondents is statistically different from 70kg (OR far from 70kg). • Since the result is not statistically significant, therefore the difference observed could be due to chance. b) Probem conclusion: • The weight of the respondents is equal to 70kg OR the weight of the respondent is not far from 70kg.. MdRodiSPSS 72 Output: Summary Table: The comparison between weight and the test value of the weight (N=106) Weight a N Mean (SE) Test value ta (df) Mean difference (95% CI) p-value 106 69.81 (0.26) 70.00 -0.750 (105) -0.19 (-0.71, 0.32) 0.455 Statistical test: One sample t-test MdRodiSPSS 73 3.2 Independent t-test • • It is most commonly used method to evaluate the differences in means between two groups. The independent t-test compared the means between two unrelated groups (independent variable) on the same continuous dependent variable.. Independent variable Dependent variable (continuous data) Group 1 Compare Means Group 2 MdRodiSPSS 74 Assumptions i. Dependent variable is either interval or ratio. ii. Random samples. iii. Independent variable consists of two independent groups – sample should appear in only one group and these groups are un-related. iv. Dependent variable is approximately normally distributed in each population OR sample size is big (n ≥ 30). v. Variances between the two groups must be equal (homogeneity of variances) – can be checked by looking at the Levene’s test. Levene's test: i. When the test is significant (p<0.05) → the variance is unequal → t-test is NOT valid. ii. When the test is not significant (p>0.05) → the variance are equal – t-test is valid.. MdRodiSPSS 75 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 76 Independent t-test Research question: • Is there any difference in the body mass index (BMI) between male and female. Step 1: Specify the Ho and Ha a) Null hypothesis: • Ho: There is no different in the mean of BMI between male and female (Ho: μmale = μfemale). b) Alternate hypothesis: • Ha: There is a different in the mean of BMI between male and female (Ha: μmale ≠ μfemale). Step 2: Choose the significant level MdRodiSPSS • α = 0.05 (two-sided).. 77 Independent t-test Step 3: Checking assumptions • Dependent variable is either interval or ratio. • Random samples. • Independent variable consists of two independent groups – sample should appear in only one group and these groups are unrelated. • Dependent variable is approximately normally distributed in each population OR sample size is big (n ≥ 30). • Equal variances between the two groups (homogeneity of variances) – can be checked by looking at the Levene’s test. Step 4: Choose the test statistic • Independent t-test with (n1 + n2 - 2) df. MdRodiSPSS 78 Steps in hypothesis testing Step 5 : Find p value - to calculate t-calc Formula: MdRodiSPSS 79 Hands-on: Independent t-test 1 4 2 3 5 6 10 7 8 MdRodiSPSS 9 80 Output: Independent t-test Descriptive statistics • • Levene’s test → test for equality of variances (assumption) Since p-value is not significant (p=0.118), we can conclude that the variances of both groups are equal. •MdRodiSPSS Therefore, t-test is valid.. 81 Output: Independent t-test • The p-value is 0.031 which is less than alpha at 0.05 Step 6: Conclusion a) Statistical conclusion: • Since the p-value (p=0.031) is less than α (0.05), we reject Ho and accept Ha. • Therefore, we can conclude that there is a significant difference in the mean of BMI between male and female (p=0.031).. MdRodiSPSS 82 Output: Independent t-test Problem conclusion: • There is a difference in the BMI between male and female [mean difference: -1.19 (95%CI: -2.27, -0.11)] • BMI for male is lower than BMI for female [30.61 (SE: 0.41) versus 31.80 (SE: 0.36)]. • We are 95% confident that the difference of BMI between male and female is in the range between 0.11 to 2.27 kg/m2 (in average 1.19kg/m2).. MdRodiSPSS 83 Output: Summary Table: The comparison of body mass index between male and female (n=106) BMI Gender N Mean (SE) ta (df) Mean difference (95%CI) p-value Male 54 30.61 (0.41) 52 31.80 (0.36) -1.19 (-2.27, -0.11) 0.031* Female -2.196 (104) * statistically significant at α=0.05 a Statistical test: Independent t-test MdRodiSPSS 84 3.3 Paired t-test • It compares two-paired observations from the same individual or on match individuals. • It is also known as a t-test for repeated measure ot a t-test for matched samples • Often used with a pre and post test design and data in pairs. Examples: i. Is the new drug decrease the patient’s blood pressure? ii. Is the new medication can reduce the weight? iii. Is there any differencee in the intraocular pressure (IOP) between right and left eye? MdRodiSPSS 85 Assumptions i. ii. iii. iv. The dependent variables must be numerical data. Random samples. The observations are paired or dependent. The difference between pair (before and after) is normally distributed, unless the sample size is big enough (n ≥ 30). - we need to calculate the difference between pair. - then determine the distribution of the difference between pair. i. if the data is normally distributed → t-test is valid ii. if the data is not normally distributed → t-test is NOT valid.. MdRodiSPSS 86 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 87 Paired t-test Research question: • Is there any difference in the systolic blood pressure before and after treatment. Step 1: Specify the Ho and Ha. a) Null hypothesis: • The mean of pre-treatment and post-treatment of systolic blood pressure are the same (Ho: μdifference = 0). b) Alternate hypothesis: • The mean of pre-treatment and post-treatment of systolic blood pressure are NOT the same (Ha: μdifference ≠ 0). Step 2: Choose the significant level. • α = 0.05 (two-sided). MdRodiSPSS 88 Paired t-test Step 3: Checking assumptions. • The dependent variables must be numerical data. • Random samples. • The observations are paired or dependent. • The difference between pair (before and after) is normally distributed, unless the sample size is big enough (n ≥ 30). Step 4: Test statistics. • Paired t-test, with (n – 1)df.. MdRodiSPSS 89 Paired t-test Step 5: Find p-value - to calculate t-calc.. Formula: d : mean difference, s : sample standard deviation, n : sample size and t : a Student t with n-1 degrees of freedom MdRodiSPSS 90 Hands on: Checking the normality of the difference 1 2 4 3 5 MdRodiSPSS • A new variable : “Dif_SBP” ill be created at the right end of the data.. 91 Hands on: Checking the normality of the difference (creating histogram) 1 4 5 2 3 MdRodiSPSS 6 92 Output: Paired t-test • The data for “Dff_SBP” is normally distributed. • Assumption fulfilled. • Paired t-test is valid.. MdRodiSPSS 93 Hands on: Paired t-test 1 4 2 3 5 MdRodiSPSS 94 Output: Paired t-test Descriptive statistics Step 6: Conclusion. a) Statistical conclusion: • Since the p-value (p<0.001) is less than α (0.05), we reject Ho and accept Ha. • Therefore, we can conclude that the mean of pre-treatment and posttreatment of systolic blood pressure are NOT the same (Ha: μdifference ≠ 0).. MdRodiSPSS 95 Output: Paired t-test b) Problem conclusion: • The mean difference of SBP before and after treatment is not 0 [mean paired difference: 8.89 (95%CI: 6.71, 11.06)] • The difference is 8.89mmHg (which is a truely difference). • The SBP before treatment is higher than SBP after treatment [146 (SE: 1.24) versus 137 (SE: 0.72)]. • We are 95% confident that the different of SBP before and after treatment was in the range of 6.71 to 11.06 mmHg (in average 8.89 mmHg). MdRodiSPSS 96 Output: Summary Table: The paired difference between SBP (before treatment) and SBP (after treatment) (N=106) N SBP 106 Before treatment, Mean (SE) After treatment, Mean (SE) 146.67 (1.23) 137.98 (0.72) Mean paired difference (95%CI) ta (df) p-value 8.89 (6.71, 11.06) 8.095 (105) <0.001* * statistically significant at α=0.05 a Statistical test: Paired t-test MdRodiSPSS 97 Thank you for your attention rodi@salam.uitm.edu.my MdRodiSPSS 98 Basic Biostatistics Workshop Using SPSS 2019 Lesson 4 Analyzing Quantitative Data: One-way Analysis of Variance (ANOVA) Dr. Mohamad Rodi Isa MBBS (Malaya), DAP&E (SEAMEO-TROPMED, M’sia), MPH (Malaya), DrPH (Malaya) Public Health Medicine, UiTM MdRodiSPSS 99 Contents Recap Introduction to statistics. Lesson 1 Descriptive Statistics. Lesson 2 Inferential Statistics: Estimation & Hypothesis Testing. Lesson 3 Analyzing Quantitative Data: T-test. Lesson 4 Analyzing Quantitative Data: Analysis of Variance (ANOVA). Lesson 5 Analyzing Categorical Data. Lesson 6 Correlation Analyses. Lesson 7 Simple Linear Regression. Lesson 8 Simple Logistic Regression Lesson 9 Non-Parametric Method Data Analyses. MdRodiSPSS 100 ANOVA • It is a technique for comparing means and is an extension of the t-test • It is useful in comparing (testing) two or more means (groups or variables) for statistical significance. • It determines means of ≥ 2 independent groups significantly different from one another. • Only one independent variable (factor / grouping) with ≥ 2 groups. i. Grouping variable → nominal ii. Outcome variable → interval of ratio. MdRodiSPSS 101 One-way ANOVA Uncontrolled DM Controlled DM Normal MEAN AGE MEAN AGE MEAN AGE MdRodiSPSS 102 One-way ANOVA Uncontrolled DM Controlled DM Normal Independent t-test Independent t-test Independent t-test MdRodiSPSS Increase in Type I or alpha error WRONGly rejecting H0 is true 103 One-way ANOVA Uncontrolled DM Controlled DM Normal 2 1 3 Overall ANOVA test SIGNIFICANT (p-value < 0.05) POST HOC TEST Which pairs have significant different of mean MdRodiSPSS 104 ANOVA Sum of Squares (SS) & Mean Squares (MS): • 2 possible sources of variations: i. between the groups - groups have different means that vary about the overall means. ii. within the groups - reflect that not all the subjects within the group have exact same values. • Types: i. One- way ANOVA : only 1 independent variable. ii. Two-way ANOVA : two independent variables. Hypotheses: • Ho: Population group means are equal to one another. • Ha: at least one pair mean difference between groups.. MdRodiSPSS 105 Assumptions i. Random samples measured in interval or ratio scales. ii. Test are independent from each other. iii. In each independent group, their dependent variables are normally distributed. iv. The variances in between group are equal population (homogenous). - It will be tested using Levene’s test (homogeneity of variances). Note: Assumptions of ANOVA test ∼ independent t-test.. MdRodiSPSS 106 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 107 ANOVA Research question: • Is there any different in the weight between race (Malay, Chinese and Indian? Step 1: Specify the hypotheses. a) Null hypothesis: • There is no mean difference in the weight between race. (Ho: μmalay = μchinese = μindian). b) Alternate hypothesis: • There is at least one pair of mean difference in the weight between race. Step 2: Choose the significant level. •MdRodiSPSS α = 0.05 (one-sided).. 108 Steps of hypothesis testing Step 3 : Choose test statistics. - ANOVA with (k – 1, n - k) df. Step 4: Checking assumptions. • Random samples measured in interval or ratio scales. • Test are independent from each other. • In each independent group, their dependent variables are normally distributed. • The variances in between group are equal population (homogenous). Step 5 : Find p value. - to calculate F-ratio (F-calculation).. MdRodiSPSS 109 Steps on calculating F-Ratio 1. 2. 3. 4. 5. Calculate the Grand Mean or the overall mean. Calculate Sum Squares: 2a - Sum of squares between groups (SSB). 2b - Sum of squares within groups (SSW) . 2c - Total sums of squares (SST). Calculate degree of freedom (df): 3a - Total. 3b - Within group. 3c - Between group. Calculate Mean Squares: 4a - Mean Square Between group (MSB). 4b - Mean Square Within group (MSW). Calculate F-Ratio.. MdRodiSPSS 110 Calculate F-Ratio (test statistic) • F-Ratio is a ratio of two sample variances. • The F-test statistic is found by dividing the between group variance (MSB) by the within group variance (MSW). F-ratio = MSB / MSW.. • The larger the differences in the mean, the larger the treatment variance component, the larger the F. • The F-ratio follows the F-distribution which is a positively skewed distribution with only positive values.. MdRodiSPSS 111 Source Table Source of variation SS Degree of freedom MS F-ratio Between SSB k–1 SSB k–1 MSB MSW Within SSW n–k SSW n–k Total SST n-1 MdRodiSPSS 112 Hands-on: One-way ANOVA 1 4 2 6 3 5 MdRodiSPSS 113 Hands-on: One-way ANOVA 7 8 10 9 MdRodiSPSS 114 Output: One-way ANOVA • Test of Homogeneity of variances it to test whether the variances are the same (one of the assumption) • Since the p-value is more than 0.05 (p=0.073), therefore we assume that the variances are homogenuous. MdRodiSPSS 115 Output: One-way ANOVA Step 6: Conclusion a) Statistical conclusion: • Since the p-value (p<0.001) is less than α (0.05), we reject Ho and accept the Ha. • Therefore, we can conclude that there is at least one pair of mean difference in the weight between race (p<0.001).. MdRodiSPSS 116 Output: One-way ANOVA b) Problem conclusion • There is at least one pair difference in the weight between race. Which pair? ? between Malay and Chinese ? Between Malay and Indian ? between Chinese and Indian MdRodiSPSS Post-hoc test 117 Hands on: One-way ANOVA (Post-hoc) 1 4 2 MdRodiSPSS 3 118 Output: One-way Post-hoc ANOVA Interpretations: • There is a significant difference in the mean weight between Malay and Chinese. The weight of Malay is significantly lower compared to weight of Chinese (mean diff.: -3.05, 95%CI: -4.14, -1.96; p<0.001) MdRodiSPSS 119 Output: Post-hoc ANOVA Interpretations: • There is a significant difference in the mean weight between Malay and Indian. The mean weight of Malay is significantly lower than mean weight of Indian (mean diff.: -4.66; 95% CI: -5.90, -3.42; p<0.001). • There is a significant difference in the mean weight between Chinese and Indian. The mean weight of Chinese is significantly lower than mean weight of Indian (mean diff.: -1.61; 95% CI: -2.77, -0.45; MdRodiSPSS 120 p=0.003).. Output: Summary Table: Mean weight between race (N=106) Variable N Mean Weight (SD) F-statisticsa (df) P-value Race: Malay Chinese Indian 33 46 27 67.29 (1.82) 70.35 (2.24) 71.96 (1.59) 85.071 (2,27) <0.001*b * statistically significant at α=0.05 a One way ANOVA test b mean SBP (after treatment) “Malay” and “Chinese” (p<0.001); “Malay” and “Indian” (p<0.001); and “Chinese” and “Indian” (p=0.003) were significantly different by post hoc test Bonferroni’s procedure.. MdRodiSPSS 121 Thank you for your attention rodi@salam.uitm.edu.my MdRodiSPSS 122 Basic Biostatistics Workshop using SPSS 2019 Lesson 5 Analyzing Categorical Data Dr. Mohamad Rodi Isa MBBS (Malaya), DAP&E (SEAMEO-TROPMED, M’sia), MPH (Malaya), DrPH (Malaya) Public Health Medicine, UiTM MdRodiSPSS 123 Contents Recap Introduction to statistics. Lesson 1 Descriptive Statistics. Lesson 2 Inferential Statistics: Estimation & Hypothesis Testing. Lesson 3 Analyzing Quantitative Data: T-test. Lesson 4 Analyzing Quantitative Data: Analysis of Variance (ANOVA). Lesson 5 Analyzing Categorical Data. Lesson 6 Correlation Analyses. Lesson 7 Simple Linear Regression. Lesson 8 Simple Logistic Regression Lesson 9 Non-Parametric Method Data Analyses. MdRodiSPSS 124 Categorical Data Analyses • • • • i. ii. Categorical Data Analysis (CDA) involves the analysis of data with a categorical response variables which is one of the nonparametric method for analyzing the data. The categorical data can be nominal or ordinal variable. The data in such a study represents count or frequencies of observations in each category. It can be: estimating single proportion and comparing two or more proportions.. MdRodiSPSS 125 Types of Categorical Data One proportion • Binomial test • Chi-square Goodness for fits Two or more proportions Analysis of categorical data • Pearson Chi-square (Chi-square test for independence) • Chi-square test for homogeneity • Yate’s correction • Fisher’s Exact test Dependent group MdRodiSPSS Stratified sampling to control confounder effect McNemar’s test Mantel-Haenszel test 126 5.1.1 Binomial Test • It is a test of one proportion. • It is an exact test of the statistical significant of deviation from the theoretically expected of observation into two categories. • Purpose: to compare the proportion observed in a sample equals with standard or special value. Assumptions: i. The outcomes can be categorized as binary data (Yes / No). ii. The observations should be independent from each other. iii. The total number of observations in category A multiplied by the total number of observations > 10 and the total number of observations in category B multiplied by the total number of observations > 10. MdRodiSPSS 127 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 128 Binomial Test Research question: • It is true that the proportion of Tuberculosis in Shah Alam is 0.5. • Our hypothesis: The proportion is 0.5. Step 1: Specify the Ho and Ha. a) Null hypothesis: • H0: The proportion of Tuberculosis in Shah Alam is 0.5 (H0: p = 0.5). b) Alternate hypothesis: • Ha: The proportion of Tuberculosis in Shah Alam is NOT 0.5 (Ha: p ≠ 0.5) (or not far from 0.5). Step 2: Choose the significant level. • α = 0.05 (two-sided).. MdRodiSPSS 129 Steps in hypothesis testing Step 3: Checking assumptions. • The outcomes can be categorized as a binary data (Yes / No). • The observations should be independent from each other. • The total number of observations in category A multiplied by the total number of observations >10, and the total number of observations in category B multiplied by the total number of observations >10. Step 4: Choose the test statistic. - Binomial test (testing of single proportion). Step 5: Find p value. - to calculate z, using formula, MdRodiSPSS 130 Hands-on: Explore First step: to find the proportion of the Tuberculosis. 1 2 4 3 5 MdRodiSPSS 131 Output: Binomial Test • The proportion of Tuberculosis is: 0.43 (95%CI: 0.34, 0.53) • Our hypothesis is: 0.5 MdRodiSPSS 132 Hands-on: Binomial Test (Method 1) 1 5 6 2 7 3 4 MdRodiSPSS 133 Output: Binomial Test Step 6: Conclusion a) Statistical conclusion: • Since p-value (p=0.207) is more than α (0.05), we DO NOT reject the Ha. • Therefore, we can conclude that the proportion of Tuberculosis in Shah Alam is 0.5 OR not far from 0.5. b) Problem conclusion: • The proportion of Tuberculosis in Shah Alam is 0.5 (OR not far from 0.5).. MdRodiSPSS 134 Hands-on: Binomial Test (Method 2) 1 4 5 2 MdRodiSPSS 3 135 Hands-on: Binomial Test (Method 2) 6 9 7 10 8 12 11 15 13 MdRodiSPSS 14 136 Output: Binomial Test • p=0.207 MdRodiSPSS 137 Output: Binomial Test Step 6: Conclusion. a) Statistical conclusion: • Since p-value (p=0.207) is more than α (0.05), we DO NOT reject the Ha. • Therefore, we can conclude that the proportion of Tuberculosis in Shah Alam is 0.5 OR not far prom 0.5. b) Problem conclusion: • The proportion of Tuberculosis in Shah Alam is 0.5 (OR not far from 0.5).. MdRodiSPSS 138 Output: Summary Table: The proportion of Tuberculosis in Shah Alam (N=106)(Hypothesis: 0.5) Variable Tuberculosis a n Proportion Test statistica 95% CI p-value 106 0.43 60.00 0.34, 0.53 0.207 Statistical test: Binomial test MdRodiSPSS 139 5.1.2 Chi-square Goodness for fits • • It is referred to as one-sample chi-square. It explores the proportion of cases that fall into the various categories of a single categorical variable, and compared these with hypothesized values. • Two values are involved: i. Observed number - which is the frequency of a category from sample. ii. Expected number - which is calculated based upon the claim distribution.. MdRodiSPSS 140 The properties of Goodness-of-fit • The data are the observed frequencies. • This means that there is only one data value for each category. • The degrees of freedom in one less than number of categories (df = k – 1) • It has a chi-square distribution with a right tail test. • The value of the test statistic doesn’t change if the order of the categories is switched.. MdRodiSPSS 141 Chi-square Goodness-of-fit Assumptions: • Random sample. • The observations must be independent and mutually exclusive. • Have count number of categorical data . • Sufficiently large sample size – to ensure that the expected count should be at least five (5) in each category (or not more than 20% of cells have expected count less than 5). • Data are categorical at nominal or ordinal levels.. MdRodiSPSS 142 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 143 Chi-square Goodness for fits Research question: • You want to determine whether the number of smokers in the sample correspond to that reported in the literature from a nationwide study (20% smokers and 80% non-smokers). Step 1: Specify the Ho and Ha. a) Null hypothesis: • Ho: The proportion of smokers and non smokers in the sample fits the given distribution (i.e. 20% smokers; 80% non-smokers). b) Alternative hypothesis: • Ha: The sample has a different distribution. Step 2: Choose the significant level. • α = 0.05 (one-sided).. MdRodiSPSS 144 Chi-square Goodness-of-fit Step 3: Checking assumptions: • Random sample. • The observations must be independent and mutually exclusive. • Have count number of categorical data . • Sufficiently large sample size – to ensure that the expected count should be at least five (5) in each category (or not more than 20% of cells have expected count less than 5). • Data are categorical at nominal or ordinal levels. Step 4: Choose the test statistic. - chi-square test for goodness-of-fit with df = 1. Step 5: Find p value. (Observed - Expected)2 - to calculate x2, using formula: x2 = Σ ---------------------------MdRodiSPSS Expected 145 Hands-on: Chi-square Goodness-for-fits (Method 1) 1 5 6 7 2 3 MdRodiSPSS 4 146 Hands-on: Chi-square Goodness-for-fits (Method 1) 8 9 10 MdRodiSPSS 147 Output: Chi-square Goodness-for-fits Descriptive statistics: • The Observed number, Expected number and residual of the data p<0.001. MdRodiSPSS 148 Output: Chi-square Goodness-for-fits Step 6: Conclusion a) Statistical conclusion: • Since the p-value (p<0.001) is less than α (0.05), we reject Ho and accept Ha. • Therefore, we can conclude that the sample sample significantly has a different distribution. b) Problem conclusion: • The proportion of smokers and non smokers in the sample fits is different from the given distribution (i.e. 20% smokers; 80% nonsmokers). * If we explore the data for smoking status, the prevalence of smoking is : 43.4% (95%CI: 33.8, 53.0). MdRodiSPSS 149 Hands-on: Chi-square Goodness-for-fits (Method 2) 1 4 5 2 MdRodiSPSS 3 150 Hands-on: Chi-square Goodness-for-fits (Method 2) 6 9 7 10 8 11 12 MdRodiSPSS 151 Hands-on: Chi-square Goodness-for-fits (Method 2) 13 14 15 16 MdRodiSPSS 152 Output: Chi-square Goodness-for-fits • p<0.001. • The conclusions are the same like previous example MdRodiSPSS 153 5.2.1 Pearson Chi-square (Chi-square test for independence) • It is a statistical hypothesis test statistic in which the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. • It is to determine the relationship or association between two categorical variables (independent and dependent). • It compares the frequency of cases found in various categories of one variable across the different categories of another variables.. MdRodiSPSS 154 5.2.1 Pearson Chi-square (Chi-square test for independence) Assumptions: i. The data must be in the form of frequencies in both variables. ii. The observations recorded are collected on a random basis. iii. Independent observations - each person or case can only be counted once. They cannot appear in more than one category or group and the data from one subject cannot influence the data from another. iv. The lowest expected frequency in any cell should be 5 (or not more than 20% of cells have expected counts less than 5) v. Both variables are categorical. MdRodiSPSS 155 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 156 Pearson Chi-square Research question: • Is there any association between smoking and heart disease? Step 1: Specify Ho and Ha a) Null hypothesis: • There is no association between smoking and heart disease. b) Alternate hypothesis: • There is an association between smoking and heart disease.. Step 2: Choose the significant level. • α = 0.05 (two-sided).. MdRodiSPSS 157 Pearson Chi-square Step 3: Checking assumptions: • Random sample. • The observations must be independent and mutually exclusive. • Have count number of categorical data. • Sufficiently large sample size – to ensure that the expected count should be at least five (5) in each category (or not more than 20% of cells have expected count less than 5). • Data are categorical at nominal or ordinal levels. Step 4: Choose the test statistic. - Pearson chi-square test with (c – 1)(r – 1) df. Step 5: Find p value. (Observed - Expected)2 - to calculate x2, using formula: x2 = Σ ---------------------------MdRodiSPSS Expected 158 Hands on: Pearson Chi-square 1 4 2 3 MdRodiSPSS 6 5 159 Hands on: Pearson Chi-square 7 10 8 9 MdRodiSPSS 160 Hands on: Pearson Chi-square 11 12 14 13 MdRodiSPSS 161 Output: Pearson Chi-square • Percentage of heart disease among smoker = 65.2% • Percentage of heart disease among non smoker = 20.0% • Since 0 cells (0.0%) have expected count less than 5, therefore Chi-square test for independence (Pearson Chisquare) is valid. MdRodiSPSS 162 Output: Pearson Chi-square Step 6: Conclusion. a) Statistical conclusion: • Since p-value (p<0.001) is less than α (0.05), we reject Ho and accept Ha. • Therefore, we can conclude that there is a significant association between smoking and heart disease.. MdRodiSPSS 163 Output: Pearson Chi-square b) Problem conclusion: • There is an association between Smoking and Heart disease. What is the association? Explaination: • to answer this question, we need to know the study design and the risk measurement. i. for cross-sectional study and case control study, the risk measurement is odds ratio (OR). ii. for cohort study, the risk measurement is Relative Risk (RR).. MdRodiSPSS 164 Output: Pearson Chi-square OR RR • Based on Odds ratio (OR): the smokers are almost 7.5 times more likely to have heart disease compared to the non-smokers [OR: 7.5 (95%CI: 3.1, 18.1)] • Based on Relative Risk (RR): the smokers are almost 2.3 times more likely to develop lung cancer compared to the non-smokers [RR: 2.3 (95%CI: 1.52, 3.49)].. MdRodiSPSS 165 Output: Summary Table: The cross-tabulation between heart disease and smoking statuses (N=106) Heart disease status (N=109) Smoking status Total, Frequency (%) Chisquare (df) p-value OR (95%CI) 22.253 (1) <0.001* 7.50 (3.12, 18.0)) Yes, Frequency (%) No, Frequency (%) Yes 30 (65.5) 16 (34.8) 46 (100.0) No 12 (20.0) 48 (80.0) 60 (100.0) * Statistically significant at α=0.05 Table: The cross-tabulation between heart disease and smoking statuses (N=106) Heart disease status Smoking status Total, Frequency (%) Chisquare (df) p-value RR (95%CI) 22.253 (1) <0.001* 2.30 (1.52, 3.49) Yes, Frequency (%) No, Frequency (%) Yes 30 (65.5) 16 (34.8) 46 (100.0) No 12 (20.0) 48 (80.0) 60 (100.0) MdRodiSPSS * Statistically significant at α=0.05 166 5.2.2 Chi-square Test-for-Homogeneity • It is to determine the distribution of a particular characteristic is similar for various groups (i.e. to see the two populations are homogenous). • It is used with a single categorical variable from two (or more) independent sample. Hypotheses: • Ho: the proportions for the two (or more) distributions are the same. • Ha: at least one of the proportion pair of the distribution is different.. MdRodiSPSS 167 Chi-square Test-for-Homogeneity Assumptions: i. The data must be in the form of frequencies in both variables. ii. The observations recorded are collected on a random basis. iii. Independent observations - each person or case can only be counted once. They cannot appear in more than one category or group and the data from one subject cannot influence the data from another. iv. The lowest expected frequency in any cell should be 5 (or not more than 20% of cells have expected counts less than 5) v. Both variables are categorical. MdRodiSPSS 168 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 169 Chi-square Test-for-Homogeneity Research Question: • Is there any different in the proportion of anxiety status (Mild. Moderate and severe) between those who have Tuberculosis and no Tuberculosis? Step 1: Specify Ho and Ha. a) Null hypothesis: • Ho: There is no significant different in the proportion of anxiety status (Mild. Moderate and severe) between those who have Tuberculosis and no Tuberculosis? b) Alternative hypothesis: • Ha: There is at least one significant difference in the proportion of anxiety status (Mild. Moderate and severe) between those MdRodiSPSS 170 who have Tuberculosis and no Tuberculosis? Chi-square Test-for-Homogeneity Step 2: Choose the significant level. • α = 0.05 (two-sided). Step 3: Checking assumptions. i. The data must be in the form of frequencies in both variables. ii. The observations recorded are collected on a random basis. iii. Independent observations - each person or case can only be counted once. They cannot appear in more than one category or group and the data from one subject cannot influence the data from another. iv. The lowest expected frequency in any cell should be 5 (or not more than 20% of cells have expected counts less than 5) v. Both variables are categorical.. MdRodiSPSS 171 Chi-square Test-for-Homogeneity Step 4: Choose the test statistic. - Pearson chi-square test with (c – 1)(r – 1) df. Step 5: Find p value. - to calculate x2, using formula: MdRodiSPSS (Observed - Expected)2 x2 = Σ ---------------------------Expected 172 Hands-on: Test-for-Homogeneity 1 2 4 3 MdRodiSPSS 6 5 173 Hands-on: Test-for-Homogeneity 7 9 8 MdRodiSPSS 174 Hands-on: Test-for-Homogeneity 10 11 13 12 MdRodiSPSS 175 Output: Test-for-Homogeneity Descriptive statistics p>0.05 MdRodiSPSS 176 Conclusion Step 6: Conclusion a) Statistical conclusion: • Since the p-value (p=0.053) is more than α (0.05), we DO NOT reject Ho. • We can conclude that there is no significant different in the proportion of anxiety between tuberculosis status. b) Problem conclusion: • There is no significant different in the proportion of anxiety status (Mild. Moderate and severe) between those who have Tuberculosis and no Tuberculosis.. MdRodiSPSS 177 Output: Summary Table: The cross-tabulation between Tuberculosis status and anxiety status (N=106) Anxiety Tuberculosis status a Total Chi square (df)a pvalue 5.860 (2) 0.053 Mild, Freq., n(%) Moderate Freq., n(%) Severe. Freq., n(%) Yes 15 (31.9) 12 (44.4) 19 (59.4) 46 (43.3) No 32 (68.1) 15 (55.6) 13 (40.6) 60 (56.6) Statistical test: Chi-square Test-for-Homogeneity MdRodiSPSS 178 5.2.3 Fisher’s Exact Test • It is a statistical significance test used in the analysis of contingency table. • Although in practice it is employed when sample sizes are small, it is valid for all samples. • It is an analysis for independence in 2 x 2 table when the assumptions for the chi-square test are not met. Criteria: • Both variables are dichotomous qualitative (2 x 2 table). • When one of the expected value in 2 x 2 table is less than 5. • The binary data are independent. • Sample size of less than 20. • Sample size of 20 to less than 50 but one or more of the cells MdRodiSPSS have expected value of less than 5.. 179 Fisher’s Exact Test Formula: (a + b)! (a + c)! (b + d)! (c + d)! x2 = ----------------------------------------(a+b+c+d)! a! b! c! d! Assumptions: • The assumptions for Fisher's exact test are almost the same like Person's chi-square test. • However: i. When one of the expected value (note: not the observed value) in a 2 x 2 table is less than 5 and especially when it is less than 1. ii. The sample size is less than 50.. MdRodiSPSS 180 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 181 Fisher’s Exact Test Research question: • You want to determine whether there is an association between expose to sun and skin cancer. • Among 17 people who expose to sun - 12 have skin cancer and 5 don’t have. • Among 10 people who did not expose to sun - 3 have skin cancer and 7 don’t have. Step 1: Specify the Ho and Ha. a) Null hypothesis: • There is no association between sun exposure and skin cancer status. b) Alternate hypothesis: • There is an association between sun exposure and skin cancer MdRodiSPSS status.. 182 Fisher’s Exact Test Step 2: Choose the significant level • α = 0.05 (two-sided). Step 3: Checking assumptions • The data must be in the form of frequencies in both variables. • The observations recorded are collected on a random basis. • Independent observations - each person or case can only be counted once. They cannot appear in more than one category or group and the data from one subject cannot influence the data from another. • Sample size less than 50. • One of the expected value in 2 x 2 table is less than 5. • Both variable are qualitative variables.. MdRodiSPSS 183 Fisher’s Exact Test Step 4: Choose the test statistic • Fisher's exact test with no df. Step 5: Find p-value • to calculate x2. • Formula: (a + b)! (a + c)! (b + d)! (c + d)! x2 = ----------------------------------------(a+b+c+d)! a! b! c! d! MdRodiSPSS 184 Hands-on: Fisher’s Exact Test 1 2 4 3 MdRodiSPSS 6 5 185 Hands-on: Fisher’s Exact Test 7 10 8 9 MdRodiSPSS 186 Hands-on: Fisher’s Exact Test 11 12 14 13 MdRodiSPSS 187 Output: Fisher’s Exact Test • The percentage of those who expose to sun developed skin cancer is 70.6%. • The percentage of those who do not expose to sun developed skin cancer is 30.0%. MdRodiSPSS • Since 1 cell (25.0%) has an expected count less than 5, the assumption for Pearson chi-square test is NOT met. • We can select Fisher’s exact test as an option to solve the problem.. 188 Conclusion Step 6: Conclusion. a) Statistical conclusion: • Since p-value is more than 0.05, we DO NOT reject Ho. • Therefore, we can conclude that there is no significant association between sun exposure and skin cancer status (p=0.057). b) Problem conclusion: •MdRodiSPSS There is no association between sun exposure and skin cancer 189 status.. Output: Summary Table: The cross-tabulation between sun exposure and skin cancer (N=27) Skin cancer status Sun exposure a Total, Frequency (%) Chisquare (df) pvalue 4.201 (1) 0.057a Yes, Frequency (%) No, Frequency (%) Yes 12 (70.6) 5 (29.4) 17 (100.0) No 3 (30.0) 7 (70.0) 10 (100.0) Statistical test: Fisher’s exact test MdRodiSPSS 190 5.2.4 Yates’ Correction • Hands-on for Yates’ Correction is the same like Fisher’s Exact Test. • However, we choose Yates’ correction when: i. Sample size is more than 50. ii. At least 1 cell (25.0%) has an expected count less than 5. • In the output, we choose Continuity Correction for the p-value. MdRodiSPSS 191 Summary No of categories RXC 2X2 2X2 RXC Sample size ≥ 50 Yes Yes No Yes / No At least 80% of cells have expected count ≥ 5 Yes No Yes / No No Appropriate test / solution Pearson Chi-square Yates’ Correction Fisher’s Exact test Collapse the categories MdRodiSPSS 192 5.3 McNemar’s test • It is a non-parametric chi-square procedure to compare the proportion obtained from 2x2 contingency table with a dichotomous trait and matched pairs of subjects, to determine whether the row and column arginal frequencies are equal (i.e. whether is “marginal homogeneity”). • Mostly, it is used in the analysis to compare before and after findings in the same individual with dichotomous outcome. • Example: a researcher want to compare the customer satisfaction (satisfy or not satisfy) before and after the campaign. • It also used in the analysis in cross-over design and matched case-control study.. MdRodiSPSS 193 2 x 2 table in McNemar’s test • The test is often used for the situation where one test for the presence (1) or absence (0) of something and variable A is the state at the first observation (i.e. pre-test) and variable B is the state at the second observation (i.e. post-test). After campaign Before campaign Total Satify Not satisfy Satisfy a b a+b Not satisfy c d c+d a+c b+d a+b+c+d Total • a & d : Concordant pair. • c & b : Discordant pair – a pair of different outcome use to test the different in outcome. MdRodiSPSS • df: (r – 1)(c – 1) 194 Hypotheses After campaign Before campaign Total Total Satify Not satisfy Satisfy a b a+b Not satisfy c d c+d a+c b+d a+b+c+d • The Ho of marginal homogeneity states that the two marginal probabilities for each outcome are the same. • i.e. Pa + Pb = Pa + Pc; and Pc + Pd = Pb + Pd • Thus, the hypotheses: - Ho: Pb = Pc - Ha: Pb ≠ Pc MdRodiSPSS 195 McNemar’s test assumptions i. The cases are a random sample from the population. ii. Related or dependent samples where one categorical dependent variable with two categories (i.e. dichotomous variable) and one categorical independent variable with two related groups. iii. The two groups of your dependent variable must be mutually exclusive – means that no groups can overlap. MdRodiSPSS 196 McNemar’s test statistics • i. • • • There are two types of McNemar test statistic: Marginal Homogeneity test: Large number of discordant (b + c > 25) x2 has a chi-squared distribution with 1 df. Formula: (b - c)2 x2 = --------b+c • Odds ratio = c / b ii. Exact binomial test (continuity correction): • when b + c < 25 • x2 is not well approximated by the chi-squared distribution. • Formula: (I b – c I - 1)2 MdRodiSPSS x2 = ---------------b+c 197 Research Question Research question: • You want to determine if the short course has an effect on the result of the test (pass and fail). • The student with the result (pre-short course) in the rows and the result (post-short course) in the columns. • The result is displayed below: Post-course Precourse Total MdRodiSPSS Total Pass Fail Pass 32 10 42 Fail 40 24 64 72 34 106 198 Steps in hypothesis testing • Step 1 : Specify the null and alternate hypotheses. • Step 2 : Choose the significance level α, one or two sided (tailed). • Step 3 : Check assumptions. • Step 4 : Choose the test statistic. • Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Statistical and Problem conclusions. • Step 7 : Interpret and give conclusion.. MdRodiSPSS 199 McNemar’s test Step 1: Specify the Ho and Ha. a) Null hypothesis: • Ho: There is no difference in the examination result before and after short course (Ho: Resultpre – Resultpost = 0). b) Alternative hypothesis: • Ha: There is a difference in the examination result before and after short course (Ha: Resultpre – Resultpost ≠ 0). Step 2: Choose the significance level. • α= 0.05 (two-sided).. MdRodiSPSS 200 McNemar’s test Step 3: Checking assumptions. • The cases are a random sample from the population. • Related or dependent samples where one categorical dependent variable with two categories (i.e. dichotomous variable) and one categorical independent variable with two related groups. • The two groups of your dependent variable must be mutually exclusive. Step 4: Choose the test statistic. • McNemar’s (Marginal Homogeneity) test with (r – 1)(c – 1) df since b + c > 25 Step 5: Find p value. (b - c)2 • to calculate x2 (Mc Nemar’s Test). Formula: x2 = --------MdRodiSPSS b+c 201 Hands-on: McNemar’s Test (method 1) 1 2 4 3 MdRodiSPSS 6 5 202 Hands-on: McNemar’s Test (Method 1) 7 8 MdRodiSPSS 9 203 Output: McNemar’s Test • Since the p-value (p<0.001) is less than α (0.05), we reject the Ho and accept the Ho. • Therefore, we can conclude that there is a difference in the examination result before and after short course.. MdRodiSPSS 204 Hands-on: McNemar’s Test (Method 2) 1 4 5 2 3 MdRodiSPSS 205 Hands-on: McNemar’s Test (Method 2) 9 6 10 7 11 8 12 MdRodiSPSS 206 Output: McNemar’s Test • Since the p-value is less than 0.05 (p<0.001) at α=0.05, we reject the Ho and accept the Ho. •MdRodiSPSS Therefore, we can conclude that there is a difference in the examination 207 result before and after short course. Hands-on: McNemar’s Test (Method 3) 1 5 6 7 2 3 MdRodiSPSS 4 208 Output: McNemar’s Test • Since the p-value is less than 0.05 (p<0.001) at α=0.05, we reject the Ho and accept the Ho. • Therefore, we can conclude that there is a difference in the examination result before and after short course. MdRodiSPSS 209 Output: Summary Table: The cross-tabulation between examination result before and after short course (N=106) Post Course Result Pre Course Result a Total, Frequency (%) Chi-square (df) p-value 16.820 (1) <0.001a Pass Fail Pass 32 10 42 Fail 40 24 64 Statistical test: McNemar ‘s test MdRodiSPSS 210 5.4 Mantel-Haenszel test • We are often interested only in investigating the relationship between two binary variables – example: a disease and an exposure. However, we have to control for confounders. • A confounding variable is a variable that may be associated with either the disease or exposure or both. • MH is a test used in the stratified analysis for categorical variables when there is confounder need to be control. • It allows an investigator to test the association between a binary predictor or treatment and a binary outcome while taking into account the stratification. • This is another way to test for conditional independence, by exploring associations in partial tables for 2 × 2 × K tables. MdRodiSPSS 211 Mantel-Haenszel • Assumptions: i. Observation are independent from each other - each observation comes from a different subject that the subjects were randomly selected from the population of interest ii. All observations are identically distributed - sample obtained in the same way. MdRodiSPSS 212 Mantel-Haenszel Test • The Mantel-Haenszel test is based on the z statistic: • Where the summation (Σ) is across levels of the confounder. • When the above test is statistically significant, the association between the disease and the exposure is real. • Because we assume that the confounder is not an effect modifier, the odds ratio is constant across its level. MdRodiSPSS 213 OR for Mantel-Haenszel • The OR at each level is estimated by ad/bc; • The Mantel-Haenszel procedure pools data across levels of the confounder to obtain a combined estimate: MdRodiSPSS 214 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 215 Mantel-Haenszel test Research question: • A researcher want to determine the impact of a new medication on the treatment of headache (adjusted by gender). Step 1: Specify the Ho and Ha. a) Null hypothesis: • Ho: There is no association between medication status and outcome of headache (adjusted by gender). b) Alternate hypothesis: • Ho: There is an association between medication status and outcome of headache (adjusted by gender). Step 2: Choose the significant level. • α = 0.05 (two-sided). MdRodiSPSS 216 Mantel-Haenszel test Step 3: Checking assumptions: • Observation are independent from each other - each observation comes from a different subject that the subjects were randomly selected from the population of interest. • All observations are identically distributed - sample obtained in the same way. Step 4: Choose the test statistics • Mantel-Haenszel test with (r - 1)(c - 1) df. Step 5: Find the p-value • Calculate z, Formula: MdRodiSPSS 217 Hands-on: Mantel-Haenszel test 1 2 4 3 7 5 6 MdRodiSPSS 218 Hands-on: Mantel-Haenszel test 8 13 9 10 11 12 MdRodiSPSS 219 Hands-on: Mantel-Haenszel test 14 15 17 16 MdRodiSPSS 220 Output: Mantel-Haenszel test Descriptive statistics MdRodiSPSS 221 Output: Mantel-Haenszel test Step 6: Conclusion a) Statistical conclusion: • Since the pvalue (p=0.004) is less than α (0.05), we reject Ho and accept Ha. b) Statistical conclusion: • Therefore, we can conclude that there is a signficant association between type of medication and headache status after adjusting for MdRodiSPSS gender (p=0.004).. 222 Output: Mantel-Haenszel test • However, there is only Female patients had found association between type of medication and headache status (p=0.004). MdRodiSPSS • There is no significant association for male patients (p=0.221).. 223 Output: Mantel-Haenszel test Association: • After adjusting for gender, it was found that only female patients associated with type of medication and headache status (p=0.004). • The odds of female patients with new medication associated with 5.82 odd MdRodiSPSS 224 (95%CI: 1.68, 20.20) to the better outcome compared to those treat with placebo.. Output: Mantel-Haenszel test • Homogenous association implies that the conditional relationship between any pair of variables given the third variable is the same across the strata in the population. • The homogeneity of the odds ratio is tested using Breslow-Day and Terone's • The output who both test are not statistically significant (p>0.05). • Therefore, we can conclude that the odds ratio is homogenous across the strata.. MdRodiSPSS 225 Output: Mantel-Haenszel test • Conditional independence is testing that odds ratios are the same and equal to 1 across the strata in the population. • The conditional independence is tested using Cochran's and MantelHaenszel tests. • The output show both test are statistically significant (<0.05). • Therefore, we can conclude that the odds ratios are NOT the same MdRodiSPSS 226 and equal 1 cross across the strata.. Output: Mantel-Haenszel test Interpretation: • There is a statistically significant between type of medication and headache status after adjusting for gender using MH test [estimate: 3.31 (95%CI: 1.45, 7.59), p=0.005].. MdRodiSPSS 227 Output: Summary Table: The cross-tabulation between Medication type and response of treatment adjusting for gender (N=106) Medication type Male Female Response Total. Chip-value a Freq, n(%) square Better Freq, n(%) Same Freq, n(%) New 12 (42.9) 16 (57.1) 28 (100.0) Placebo 7 (26.9) 19 (73.1) 26 (100.0) New 16 (59.3) 11 (40.7) 27 (100.0) Placebo 5 (20.0) 20 (80.0) 25 (100.0) 8.443 (1) 0.005* MH OR (95%CI) 3.31 (1.45, 7.59) * Statistically significant at α=0.05 a Mantel-Haenszel test MdRodiSPSS 228 Thank you for your attention rodi@salam.uitm.edu.my MdRodiSPSS 229 Basic Biostatistics Workshop Using SPSS 2019 Lesson 6 Correlation Analyses Dr. Mohamad Rodi Isa MBBS (Malaya), DAP&E (SEAMEO-TROPMED, M’sia), MPH (Malaya), DrPH (Malaya) Public Health Medicine, UiTM MdRodiSPSS 230 Contents Recap Introduction to statistics. Lesson 1 Descriptive Statistics. Lesson 2 Inferential Statistics: Estimation & Hypothesis Testing. Lesson 3 Analyzing Quantitative Data: T-test. Lesson 4 Analyzing Quantitative Data: Analysis of Variance (ANOVA). Lesson 5 Analyzing Categorical Data. Lesson 6 Correlation Analyses. Lesson 7 Simple Linear Regression. Lesson 8 Simple Logistic Regression Lesson 9 Non-Parametric Method Data Analyses. MdRodiSPSS 231 Re-cap: Exploring relationship among variables Remarks Relationship only (2 categorical variables) x2 of independence or homogeneity Has been cover in categorical analyses Relationship with strength & direction (quantitative) Correlation: Pearson's, Spearman Rank, Kendall's Tau Phi, Cramer's V, etc will cover in this lesson & in nonparametric method Prediction between dependent variabl and independent variable(s) Regression (simple, multiple, logistic, etc) will be covered in Regression Analyses Identify the structure underlying a group or related variables Reliability test & Factor Analysis Statistical Techniques MdRodiSPSS 232 Simple Correlation MdRodiSPSS 233 6.1 Simple Correlation ▪ ▪ Correlation is defined as the quantification of the degree to which two continuous variables are related, providing the relationship is linear. It measures the strength of the linear relationship between two variables, without taking into consideration the fact that both these variables may be influenced by a third variable. Correlation coefficient, r: • It measures the direction and the strength of association in correlation. • The value can ranges between –1 to +1: i. +1 : perfect positive correlation ii. 0 : no correlation at all. iii. -1 : perfect negative correlation.. MdRodiSPSS 234 Magnitude of association, r • The qualitative description of the strength of the linear relationship and the qualitative value of r. Value of r ±1 ±0.75 to ±1 Qualitative description of the strength * Perfect correlation Strong (Positive / Negative) correlation ±0.50 to ±0.75 Moderate (Positive / Negative) correlation ±0.25 to ±0.50 Weak (Positive / Negative) correlation 0 to ±0.25 No linear correlation * note - different books will give different classifications MdRodiSPSS 235 Different types of correlation analyses Classification of correlation analyses Types of Correlation On the basis of degree of correlation On the basis of number of variables On the basis of Linearity Positive Correlation Simple Correlation Linear Correlation Negative Correlation Partial Correlation Non-Linear Correlation Multiple Correlation 236 MdRodiSPSS Assumptions i. ii. iii. iv. v. vi. The samples must be pair related The sample must be randomly selected. The observation must be independent. The scale of measurement should be intervals or ratio. Both variables (x1 and x2) must be normally distributed. Assume that a straight line in relationship between each of the variable in the analysis (Linearity) vii. Assume that data is normal distributed about the regression line (Homoscedasticity).. MdRodiSPSS 237 Steps in Hypothesis testing • • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Calculate, r Step 6 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 7 : Conclusions (statistical & problems) • Step 8 : Interpret and give overall conclusion.. MdRodiSPSS 238 Research Question Research question: • What is the correlation between cholesterol and calorie intake ? Step 1: Specify the Ho and Ha. a) Null hypothesis: • The linear correlation between cholesterol and calories intake is 0 (r = 0). b) Alternate hypothesis: • The linear correlation between cholesterol and calories intake is not 0 (r ≠ 0). Step 2: Choose the significant level. • α = 0.05 (two-sided).. MdRodiSPSS 239 Simple Correlation Step 3: Checking assumptions. • The samples must be pair related. • The sample must be randomly selected. • The observation must be independent. • The scale of measurement should be intervals or ratio. • Both variables (x1 and x2) must be normally distributed. • Assume that a straight line in relationship between each of the variable in the analysis (Linearity). • Assume that data is normal distributed about the regression line (Homoscedasticity).. Step 4: Choose the test statistic • Pearson correlation with (n - 2) df.. MdRodiSPSS 240 Simple Correlation Step 5: Calculate, r • Formula: Step 6: Find the p-value • to calculate the t-calc. • Formula: MdRodiSPSS 241 Hands-on: distribution of cholesterol & calorie intake 1 2 4 5 3 9 6 7 MdRodiSPSS 8 242 Output: distribution of cholesterol & calorie intake Skewness: • Cholesterol: -0.168 • Calorie intake: 0.151 • Both less than ± 1 • It can be concluded that both variables are normally distributed. KS test: • Both variables are not statistically significant (p>0.05). MdRodiSPSS • It can be concluded that both variables are normally distributed. 243 Output: distribution of cholesterol & calorie intake Cholesterol MdRodiSPSS Calorie intake 244 Hands-on: Scatter plot 1 4 2 5 3 MdRodiSPSS 245 Hands-on: Scatter plot 6 7 MdRodiSPSS 8 246 Output: Scatter plot • From the scatter plot, we have a rough idea that there is a linear MdRodiSPSS 247 positive correlation between cholesterol and calories intake. Hands-on: Simple Correlation 1 4 2 3 5 6 MdRodiSPSS 248 Output: Simple Correlation r = 0.940, p<0.001 Step 7: Conclusion a) Statistical conclusion • Since the p-value (p<0.001) is less than α (0.05), we reject Ho and accept Ha. • Therefore, we can conclude that there was a significant linear correlation between cholesterol and calorie intake (p<0.001). • The linear correlation between cholesterol and calorie intake was MdRodiSPSS not 0.. 249 Output: Simple Correlation r = 0.940, p<0.001 Step 7: Conclusion b) Problem conclusion: ▪ There are a positive strong linear correlation between sholesterol and calorie intake (r=0.940). MdRodiSPSS 250 Summary: Simple Correlation Table: The correlation between Calorie intake and Cholesterol (N=106) Variables Mean (SD) ra p-value Calorie intake 10.23 (2.85) 0.940 <0.001* Cholesterol 768.31 (373.97) * Statistical significant at α=0.05 a Statistical test: Pearson correlation MdRodiSPSS 251 Partial Correlation MdRodiSPSS 252 6.2 Partial Correlation • The partial correlation is the correlation between two continuous variables, adjusting by the third variable. • When there are many factors influence the outcome, it is possible to control the variables and the effect of each variable can be studied separately. factor1 Outcome Other factors (confounding) MdRodiSPSS 253 Partial correlation analysis • Consider a correlation matrix for variables A, B and C A B C A * r(AB) r(AC) B - * r(BC) C - - * ▪ The partial correlation of A and B controlling (adjusted) for C is: r(AB)C MdRodiSPSS 254 Partial Correlation Assumptions: i. The samples must be pair related ii. The sample must be randomly selected. iii. The observation must be independent. iv. The scale of measurement should be intervals or ratio. v. Both variables (x1,x2 and x3) must be normally distributed. vi. Assume that a straight line in relationship between each of the variable in the analysis (Linearity) vii. Assume that data is normal distributed about the regression line (Homoscedasticity).. MdRodiSPSS 255 Steps in Hypothesis testing • • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Calculate, r Step 6 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 7 : Conclusions (statistical & problems) • Step 8 : Interpret and give overall conclusion.. MdRodiSPSS 256 Research Question Research question: • What is the correlation between cholesterol and calories intake (adjusted by age) ? Step 1: Specify the Ho and Ha a) Null hypothesis: • The linear correlation between cholesterol and calories intake (adjusted by age) is 0 (r = 0). b) Alternate hypothesis: • The linear correlation between cholesterol and calories intake (adjusted by age) is not 0 (r ≠ 0). Step 2: Choose the significant level •MdRodiSPSS α = 0.05 (two-sided) 257 Partial Correlation Step 3: Checking assumptions • The samples must be air related • The sample must be randomly selected. • The observation must be independent. • The scale of measurement should be intervals or ratio. • Both variables (x1,x2 and x3) must be normally distributed. • Assume that a straight line in relationship between each of the variable in the analysis (Linearity) • Assume that data is normal distributed about the regression line (Homoscedasticity).. Step 4: Choose the test statistic • Partial correlation with (n - 2) df MdRodiSPSS 258 Partial Correlation Step 5: calculate r • Formula: Step 6: Find the p-value • to calculate the t-calc • Formula: MdRodiSPSS 259 Hands-on: Partial Correlation 1 4 2 5 3 6 MdRodiSPSS 260 Output: Partial Correlation r = 0.939, p<0.001 Step 6: Conclusion a) Statistical conclusion • Since the p-value (p<0.001) is less than α (0.05), we reject Ho and accept Ha. • Therefore, we can conclude that there are a significant correlation between cholesterol and calorie intake adjusted by age (p<0.001). • The linear correlation between cholesterol and calorie intake (adjusted by age) is not 0 (r ≠ 0).. MdRodiSPSS 261 Output: Partial Correlation r = 0.939, p<0.001 Step 6: Conclusion b) Problem conclusion • There are a positive strong linear partial correlation between sholesterol and calorie intake adjusted by age (r=0.935).. MdRodiSPSS 262 Summary Table: The partial correlation between Calorie intake and Cholesterol (adjusted by age) (N=103) Control Variable Mean (SD) ra p-value Age Calorie intake 10.23 (2.85) 0.939 <0.001* Cholesterol 768.31 (373.97) * Statistical significant at α=0.05 a Statistical test; Partial correlation MdRodiSPSS 263 Re-cap: Some of the correlation coefficients Name First variable Second variable Pearson, r Interval / Ratio Interval / Ratio Spearman rho, ρ Ordinal Ordinal Kendall’s Tau Ordinal Ordinal Phi Dichotomous (Nominal) Dichotomous (Nominal) Intraclass, R Intervals / Ratio (Test) MdRodiSPSS Intervals / Ratio (Re-test) 264 Thank you for your attention rodi@salam.uitm.edu.my MdRodiSPSS 265 Basic Biostatistics Workshop Using SPSS 2019 Lesson 7 Simple Linear Regression Dr. Mohamad Rodi Isa MBBS (Malaya), DAP&E (SEAMEO-TROPMED, M’sia), MPH (Malaya), DrPH (Malaya) Public Health Medicine, UiTM MdRodiSPSS 266 Contents Recap Introduction to statistics. Lesson 1 Descriptive Statistics. Lesson 2 Inferential Statistics: Estimation & Hypothesis Testing. Lesson 3 Analyzing Quantitative Data: T-test. Lesson 4 Analyzing Quantitative Data: Analysis of Variance (ANOVA). Lesson 5 Analyzing Categorical Data. Lesson 6 Correlation Analyses. Lesson 7 Simple Linear Regression. Lesson 8 Simple Logistic Regression Lesson 9 Non-Parametric Method Data Analyses. MdRodiSPSS 267 Simple Linear Regression • Regression analysis is a statistical methodology to estimate the relationship (using a theoretical or an empirical model) of a response variable to a set od predictor variables. • It helps to understand how the typical value of one dependent continuous variable (usually called y) changes when any one of the independent variable (continuous or categorical) (usually called x) is varies. • The dependent variables the variable for which we want to make a prediction. • Unlike correlation analysis, in regression analysis, it have to identify the dependent and independent variables.. MdRodiSPSS 268 Simple Linear Regression • General regression model: Y = β0 + β1X + ε Where: • β0 and β1 are parameters: where β0 is the intercept and β1 is the regression coefficient. • X is a known constant. • Deviation ε are independent N(0,σ2). Meaning: • The values of the regression parameter, β0 and β1 are not known, we estimate them from data. • β1 indicates the change in the mean response per unit increase in X.. MdRodiSPSS 269 Regression Line • If the scatter plot of our sample data suggests a linear relationship between two variables i.e. y = β0 + β1x • We can summarize the relationship by drawing a straight line on the plot. • Least squares methods give us the “best” estimated line for our set of sample data. • We will write an estimated regression line based on sample data as: MdRodiSPSS 270 Regression Line • The method of least squares chooses the values for b0, and b1 to minimize the sum of squared errors. • Using calculus, we obtain estimating formulas: • or, MdRodiSPSS 271 Beta Coefficient, bx • It is the slope in the population, in the regression model. • It reflects the amount of change in the dependent variable per one unit increase in independent variable. • No range of values for the Coefficient. • It is calculated as the slope of the linear line that fit the data in the scatter plot. • The method of fitting the line is called “least equation” method.. MdRodiSPSS 272 Simple Linear Regression 180 160 140 Slope, b (Popul. – β) Regression Coefficient 120 100 1 unit 80 60 4 MdRodiSPSS 5 6 7 8 Cholestrol level e.g. “b = 15” refers to: mean blood pressure will increase by 15Rsq points, = 0.9238 9 10 when cholesterol level increase by 1 unit. 273 Coefficient of Determination, r2 • It can be interpreted as the proportion of the variability among the observed values of y that is explained by the linear regression of Y on X. • The r2 is the ratio of the explained variation to the total variation. • It is equal of the square of the Pearson correlation coefficient • Range value: O < r2 < 1, and denotes the strength of the linear association between x and y. • r2 represents the percent of the data that is the closest to the line of best fit.. MdRodiSPSS 274 Coefficient of Determination, r2 • Example: The linear relationship between blood pressure and cholesterol level. r = 0.829, then r2 = 0.687. • Means that 68.7% of the total variation in blood pressure (y) can be explained by the linear relationship of cholesterol level (x) (as described by the regression equation). • The other 31.3% of the total variation in blood pressure (y) remains unexplained.. MdRodiSPSS 275 Assumptions • • • • • • The sample must be related. The sample must be randomly selected. The observation must be independent. The scale of measurement must be intervals or ratio. Both variable (x and y) must be normally distributed. Assume that a straight line in relationship between each of the variable in the analysis • The means of subpopulation Y, all lie on the same straight line (assumption of linearity). • For each value of x1, there is sub-population of Y, which must be normally distributed. • Assume that the data is normally distributed on the regression line (Homoscedasticity).. MdRodiSPSS 276 Steps in Hypothesis testing • • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the regression line. Step 6 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 7 : Conclusions (statistical & problems) • Step 8 : Interpret and give overall conclusion.. MdRodiSPSS 277 Simple Linear Regression Research question – 3 scenarios: i.What is the relationship between cholesterol and calories intake (continuous variable)? ii.What is the relationship between cholesterol and gender (male and female) (binary independent variable) ? iii.What is the relationship between cholesterol and race (Malay, Chinese and Indian) (more than 2 categorical independent variables) ? MdRodiSPSS 278 Research Question (1) Research question: • What is the relationship between cholesterol and calories intake ? Step 1: Specify the Ho and Ha a) Null hypothesis: • The relationship between cholesterol and calories intake is 0 (β = 0). b) Alternate hypothesis: • The relationship between cholesterol and calories intake is not 0 (β ≠ 0). Step 2: Choose the significant level •MdRodiSPSS α = 0.05 (two-sided) 279 Assumptions Step 3: Checking assumptions • The sample must be related. • The sample must be randomly selected. • The observation must be independent. • The scale of measurement must be intervals or ratio. • Both variable (x and y) must be normally distributed. • Assume that a straight line in relationship between each of the variable in the analysis • The means of subpopulation Y, all lie on the same straight line (assumption of linearity). • For each value of x1, there is subpopulation of Y, which must be normally distributed. • Assume that the data is normally distributed on the regression line (Homoscedasticity) MdRodiSPSS 280 Assumptions Step 4: Choose test statistic • Simple linear regression analysis with (n - 2) df i. Calorie intake (continuous variable) - dependent ii. Cholesterol level (continuous variable) - independent Step 5: Find the regression line MdRodiSPSS 281 Assumptions Step 6: Find the p-value • to calculate the t-calc • Formula: t = b1 / SE(b) • SE(b) denotes the standard error of b MdRodiSPSS 282 Hands-on: SLR 1 4 7 5 2 6 3 MdRodiSPSS 283 Hands-on: SLR 9 8 12 10 11 MdRodiSPSS 284 Hands-on: SLR 13 14 15 16 MdRodiSPSS 17 285 Output: SLR MdRodiSPSS 286 Output: SLR i. • ii. • r = 0.940 The linear correlation between calorie intake and cholesterol is positice and the strength is strong r2 = 0.884 88.4% of the calorie intake is explained by the variation of cholesterol. The other 11.6% is explained other factors. Are the correlation and coefficient of determination significant? MdRodiSPSS 287 Output: SLR Step 7: Conclusion a) Statistical conclusion • Since p-value (p<0.001) is less than α at 0.05, therefore we reject Ho and accept Ha. • We can concluded that the relationship between cholesterol and calories intake is not 0 (β ≠ 0). • The correlation (r = 0.940) and coefficient of determination (r2 = 0.884) are not due to chance.. MdRodiSPSS 288 Output: SLR Step 7: Conclusion a) Problem conclusion • The linear relationship between calorie intake and cholesterol level is not 0. • The equation between calorie intake and cholesterol is: Calorie intake = -494.30 + 123.488*cholesterol MdRodiSPSS 289 Output: SLR Step 7: Conclusion a) Problem conclusion Calorie intake = -494.30 + 123.488*cholesterol • This mean that if cholesterol level is increased by 1 mmol/l, we would expect the calorie intake to increase by 123.49 (95%CI: 114.79, 132.18) unit. • The cholesterol level is explained 88.4% of the variation of calorie intake. MdRodiSPSS 290 • The other 11.6% is explained by other factors.. Output: SLR MdRodiSPSS 291 Output: SLR • The scatter plot is not looking any pattern. • It can be concluded that the assumption is fit. MdRodiSPSS 292 Summary Table: The relationshio between cholesterol level to the calorie intake (N=106) Variable Cholesterol b (95%CI) ta p-value 123.49 (114.79, 132.18) 28.166 <0.001 * statistical significant at α=0.05 a Statistical test: Simple Linear Regression MdRodiSPSS 293 Research Question (2) Research question: • What is the relationship between cholesterol and gender (male and female) ? • In this example we can see that: i. Dependent variable (cholesterol) is continuous variable ii. Independent variable (gender – male & female) is categorical variable. • In the previous lesson, we can compare is there any different in the mean of cholesterol between male and female. • In that scenario, we use independent t-test to solve the problem. • However, we can use simple linear regression to analyse the research question.. MdRodiSPSS 294 Hands-on: SLR 1 7 4 5 2 3 MdRodiSPSS 6 295 Hands-on: SLR 9 8 12 10 11 MdRodiSPSS 296 Hands-on: SLR 13 14 15 16 17 MdRodiSPSS 297 Output: SLR MdRodiSPSS 298 Output: SLR i. • ii. • r = 0.261 Since gender is categorical variable, therefore we cannot conclude that there is weak linear correlation between cholesterol and gender. r2 = 0.068 We can conclude that 6.8% of the cholesterol is explained by the variation of gender. The other 93.2% is explained other factors. Are the correlation and coefficient of determination significant? MdRodiSPSS 299 Output: SLR Step 7: Conclusion a) Statistical conclusion • Since p-value (p<0.001) is less than α at 0.05, therefore we reject Ho and accept Ha. • We can concluded that the relationship between cholesterol and gender is not 0 (β ≠ 0). • The coefficient of determination (r2 = 0.068) are not due to chance.. MdRodiSPSS 300 Output: SLR Step 7: Conclusion a) Problem conclusion • The relationship between gender and cholesterol level is not 0. • The equation between gender and cholesterol is: Cholesterol level = 9.50 + 1.477*gender MdRodiSPSS 301 Output: SLR Step 7: Conclusion a) Problem conclusion Cholesterol level = 9.50 + 1.477*gender • Coding: 0 = male; 1 = female • This mean that is the respondent is male (0), the cholesterol level is 9.50 • However, if the respondent is female (1), the cholesterol level will be : 9.50 + 1.477(1) MdRodiSPSS 302 = 10.977.. Output: SLR Step 7: Conclusion a) Problem conclusion Cholesterol level = 9.50 + 1.477*gender • This mean that if the respondent is female, the cholesterol level is increased by 1.477 (95%CI: 0.41, 2.54) mmol/l. • The gender is explained 6.8% of the variation of cholesterol. • The other 93.2% is explained by other factors.. MdRodiSPSS 303 Output: SLR MdRodiSPSS 304 Output: SLR • The scatter plot is not looking any pattern. • It can be concluded that the assumption is fit. MdRodiSPSS 305 Summary Table: The relationship between cholesterol level to the gender (N=106) Variable Cholesterol b (95%CI) ta p-value 1.477 (0.41, 2.54) 2.752 0.007 * statistical significant at α=0.05 a Statistical test: Simple Linear Regression MdRodiSPSS 306 Research Question (3) Research question: • What is the relationship between cholesterol and race (Malay, Chinese and Indian) ? • In this example we can see that: i. Dependent variable (cholesterol) is continuous variable ii. Independent variable (race - Malay, Chinese and Indian) is categorical variable (more than two categories). • In the previous lesson, we can compare is there any different in the mean of cholesterol between in between races (Malay, Chinese and Indian). • In that scenario, we use ANOVA to solve the problem. • However, we can use simple linear regression to analyse the research question. MdRodiSPSS 307 • But we must create dummy coding.. Research Question (3) From dataset: SLR • Coding: 1 = Malay, 2 = Chinese; and 3 = Indian • Therefore, we need to transform the code (n – 1): Vector MdRodiSPSS X1 X2 Malay (1) 0 0 Chinese (2) 1 0 Indian (3) 0 1 308 Hands-on: Creating Dummy Coding 1 3 4 5 6 2 7 8 MdRodiSPSS 309 Hands-on: Creating Dummy Coding 2 1 7 6 8 3 11 10 12 MdRodiSPSS 13 310 Then, click OK Output: Creating Dummy Coding •MdRodiSPSS Then to the same thing for the second vector (X2) • Dummy Coding: Malay = 0; Chinese = 0; and Indian = 1 311 Output: Creating Dummy Coding MdRodiSPSS 312 Hands-on: SLR 1 7 4 5 2 6 3 MdRodiSPSS 313 Hands-on: SLR 9 8 12 10 11 MdRodiSPSS 314 Hands-on: SLR 13 14 15 16 MdRodiSPSS 17 315 Output: SLR MdRodiSPSS 316 Output: SLR i. • ii. • r = 0.276 Since race is categorical variable, therefore we cannot conclude that there is weak linear correlation between cholesterol and race. r2 = 0.076 We can conclude that 7.6% of the cholesterol is explained by the variation of gender. The other 92.4% is explained other factors. Are the correlation and coefficient of determination significant? MdRodiSPSS 317 Output: SLR Step 7: Conclusion a) Statistical conclusion • Since p-value (p=0.017) is less than α at 0.05, therefore we reject Ho and accept Ha. • We can concluded that the relationship between cholesterol and grace is not 0 (β ≠ 0). • The coefficient of determination (r2 = 0.076) are not due to chance. MdRodiSPSS 318 Output: SLR Step 7: Conclusion a) Statistical conclusion • At the beginning, we decided that Malay was chosen as a reference group. • Therefore, we can conclude that: i. There is a significance difference in the mean of cholesterol level between India and Malay (p=0.006). ii. However, there is no significant difference in the mean of cholesterol between Chinese and Malay (p=0.448). MdRodiSPSS 319 • We cannot compare the difference between Chinese and Indian.. Output: SLR Step 7: Conclusion a) Problem conclusion • When the respondent is Indian, the cholesterol level is increased by 2.02 (95%CI: 0.59, 3.44) mmol/l compared to Malay • However, when the respondent in Chinese, there is no significant difference in the cholesterol level compared to Malay. • We can conclude that 7.6% of the cholesterol is explained by the variation of race. The other 92.4% is explained other factors.. MdRodiSPSS 320 Output: SLR MdRodiSPSS 321 Output: SLR • The scatter plot is not looking any pattern. • It can be concluded that the assumption is fit. MdRodiSPSS 322 Thank you for your attention rodi@salam.uitm.edu.my MdRodiSPSS 323 Basic Biostatistics Workshop Using SPSS 2019 Lesson 8 Simple Logistic Regression Dr. Mohamad Rodi Isa MBBS (Malaya), DAP&E (SEAMEO-TROPMED, M’sia), MPH (Malaya), DrPH (Malaya) Public Health Medicine, UiTM MdRodiSPSS 324 Contents Recap Introduction to statistics. Lesson 1 Descriptive Statistics. Lesson 2 Inferential Statistics: Estimation & Hypothesis Testing. Lesson 3 Analyzing Quantitative Data: T-test. Lesson 4 Analyzing Quantitative Data: Analysis of Variance (ANOVA). Lesson 5 Analyzing Categorical Data. Lesson 6 Correlation Analyses. Lesson 7 Simple Linear Regression. Lesson 8 Simple Logistic Regression Lesson 9 Non-Parametric Method Data Analyses. MdRodiSPSS 325 Introduction • Logistic regression is a form regression analysis in which the outcome (dependent) variable is category. • What is the “Logistic” component? - Instead of modelling the outcome, Y, directly the method models the LOG ODDS (Y) using the logistic function. • What is the “Regression” component? - Method used to quantify association between an outcome and predictor variables. - Could be used to build predictive models as a function of predictors.. MdRodiSPSS 326 Introduction • Types of logistic regression: i. Binary (binary) logistic regression: when the dependent is a dichotomy. - Example: Disease: Yes/No. ii. Multinominal logistic regression: when there are more than two levels in the dependent variable. - Example: Severity of disease: mild / moderate / severe. • The predictor (independent variable) can be both continuous and/or categorical variables. • Simple logistic regression is when the independent variable is only one whereas multiple logistic regression is when the independent variable is more than one.. MdRodiSPSS 327 Logistic Function • The logistic function for a single predictor: Z = β0 + β1X1 + ε • The Z value is then transformed using a link function to obtain the probability of the event occurring. • For binary outcome, the link function is: • The Plot: MdRodiSPSS 328 Dichotomous Predictor • Therefore, for the odds ratio associated with risk presence we have: OR = eβ1 • Taking the natural logarithm we have: Ln(OR) = β1 • thus the estimated regression coefficient associated with a 0 - 1 coded dichotomous predictor is the natural log of the OR associated with risk presence!! • In this practical, there are three types of simple logistic regression analyses: i. Continuous independent variable (e.g. radiation time). ii. Binary independent variable (e.g. smoking statis = Yes / No). iii. More than two independent variables (e.g. diabetes mellitus norma, impaired and DM).. MdRodiSPSS 329 Assumptions • Logistic regression does not assume a linear relationship between dependent and independent variables. • The dependent variable must be a dichotomy (two categories) • The independent variables need not be interval, nor normally distributed, nor linearly related nor of equal variance within each group. • The categories (groups) must be mutually exclusive and exhaustive. • Larger samples are needed than for linear regression because maximum likelihood coefficients are larger sample estimate. A minimum of 50 cases per predictor is recommended. MdRodiSPSS 330 Steps in Hypothesis testing • • • • • Step 1 : Specify the null hypothesis and alternate hypothesis. Step 2 : Choose the significance level α, one or two-sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Find the p-value, compare with α: - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 6 : Conclusions (statistical & problems) • Step 7 : Interpret and give overall conclusion.. MdRodiSPSS 331 Research Question (1) Research question: • Does radiation time associated with lung cancer? • Independent vriable: radiation time (continuous variable) • Dependent variable: Lung cancer (binary: Yes / No) Step 1: Specify Ho and Ha a) Null hypothesis: • Ho: There is no association between radiation time and lung cancer. b) Alternate hypothesis: • Ha: The is an association between radiation time and lung cancer.. MdRodiSPSS 332 MdRodiSPSS 333 Research Question (1) Step 2: Choose the significant level • α = 0.05 (one-sided) Step 3: Checking assumptions • Logistic regression does not assume a linear relationship between dependent and independent variables. • The dependent variable must be a dichotomy (two categories) • The independent variables need not be interval, nor normally distributed, nor linearly related nor of equal variance within each group. • The categories (groups) must be mutually exclusive and exhaustive. • Larger samples are needed than for linear regression because maximum likelihood coefficients are larger sample estimate. A minimum of 50 cases per predictor is recommended. MdRodiSPSS 334 Research Question (1) Step 4: Choose statistical test • Simple logistic regression with (r - 1)(c - 1) df Step 5: Find the p-value • to calculate the Wald Chi-square • Formula: Wald Chi-square = (co-efficient estimate / SE)2 MdRodiSPSS 335 Hands-on: SLogR 1 4 7 5 2 6 3 MdRodiSPSS 336 Hands-on: SLogR 8 9 10 MdRodiSPSS 11 337 Output: SLogR (Beginning Block) MdRodiSPSS 338 Output: SLogR (Method = Enter) Omnibus Test of Model Co-efficients: • Here SPSS has added the “radiation time” variable as a predictor. • The Model Coefficient gives a Chi-square of 12.858 on 1 degree of freedom (df) significant beyond 0.001. • This is a test of the Ho that adding the “radiation time” variable to the model has significantly increased to predict lung cancer. • Therefore, the inference, adding “radiation time” variable into the model improves the model.. MdRodiSPSS 339 Output: SLogR (Method = Enter) -2 Log Likelihood (LL): • Under model summary, we can see the -2 Log LL statistic is 129.490. • This statistic measures how poorly the model predict lung cancer - the smaller the statistic, the better the model. • Although SPSS does not give us this statistic for the model that had only the intercept, but you can derive it to be: 129.490 + 12.858 =142.348. • Adding “radiation time” variable reduced the -22LL statistic by 142.348 - 129.490 = 12.858, which is the x2 statistic as seen in the omnibus test. MdRodiSPSS 340 Output: SLogR (Method = Enter) Cox & Snell R2: • The cox & Snell R2 can be interprets like R2 in linear regression but can't reach a maximum value of 1. • In this study we can conclude that 11.4% of lung cancer is explained by radiation time. Nagelkerke R2: • It is a modification of Cox & Snell R2. • The Nagelkerke R2 can reach a maximum of 1.. MdRodiSPSS 341 Output: SLogR (Method = Enter) Hosmer & Lemeshow test: • Pearson Chi-square: to test the Ho that the number of observed and predicted outcomes are not different. • If the Chi-square statistic for its degrees of freedom would be smaller than the critical values giving a p-value > 0.05, it indicates that the Ho cannot be rejected and the model does fits well. • Of the other way round, the model does not fits well. • In this analysis, we found that the p-value was more than 0.05. • It can be say that there was no statisticaly significant (p>0.05). • Therefore, we can conclude that the model does fits well. MdRodiSPSS 342 Output: SLogR (Method = Enter) • This is the contingency table of the sample MdRodiSPSS 343 Output: SLogR (Method = Enter) Classification Table: • It is explaining the sensitivity and specificity of the independent variable to the dependent variable. • It is not very important for simple logistic regression, but very important in multiple logistic regression, MdRodiSPSS 344 Output: SLogR (Method = Enter) Step 6: Conclusion a) Statistical conclusion. • Since the p-value (p=0.001) is less than α (0.05), we can reject Ho and accept Ha. • Therefore, we can conclude that there is an association between radiation time and lung cancer.. MdRodiSPSS 345 Output: SLogR (Method = Enter) Step 6: Conclusion b) Problem conclusion. • The variable in the equation output show the regression equation is: P/(1 – P) = exp (0.053*radiation) – 7.837 • For increase of 1 unit radiation (in second), there will be increase in 0.053 log odds risk of lung cancer. • OR = exp 0.053 = 1.055 MdRodiSPSS 346 Output: SLogR (Method = Enter) Example: • For an increase 5 units radiation time (in second), it will increase OR exp 0.053(5) = 1.30 risk of lung cancer. • For an increase 10 units radiation time (in second), it will increase OR exp 0.053(10) = 1.69 risk of lung cancer.. MdRodiSPSS 347 Summary: SLogR Table: The association between radiation time and lung cancer (N=106) Radiaiton time B (SE) Walda (df) OR (95%CI) p-value 0.053 (0.016) 11.013 (1) 1.06 (1.02, 1.09) 0.001* * statistically significant at ɑ=0.05 a Statisitcal test: Simple Logistic Regression MdRodiSPSS 348 Research Question (2) Research question: • Does smoking associated with lung cancer? • Independent vriable: Smoking status (binary: Yes / No) • Dependent variable: Lung cancer (binary: Yes / No) Step 1: Specify Ho and Ha a) Null hypothesis: • Ho: There is no association between smoking and lung cancer. b) Alternate hypothesis: • Ha: The is an association between smoking and lung cancer. MdRodiSPSS 349 Hands-on: SLogR 1 4 7 5 2 6 3 MdRodiSPSS 350 Hands-on: SLogR 8 9 10 MdRodiSPSS 11 351 Output: SLogR (Beginning Block) MdRodiSPSS 352 Output: SLogR (Method = Enter) Omnibus Test of Model Co-efficients: • Here SPSS has added the “smoking” variable as a predictor. • The Model Coefficient gives a Chi-square of 22.859 on 1 degree of freedom (df) significant beyond 0.001. • This is a test of the Ho that adding the “smoking” variable to the model has significantly increased to predict lung cancer. • Therefore, the inference, adding “smoking” variable into the model improves the model.. MdRodiSPSS 353 Output: SLogR (Method = Enter) -2 Log Likelihood (LL): • Under model summary, we can see the -2 Log LL statistic is 119.489. • This statistic measures how poorly the model predict lung cancer - the smaller the statistic, the better the model. • Although SPSS does not give us this statistic for the model that had only the intercept, but you can derive it to be: 119.489 + 22.859 = 142.348. • Adding “radiation time” variable reduced the -2 LL statistic by 142.348 - 119.489 = 22.859, which is the x2 statistic as seen in the omnibus test.. MdRodiSPSS 354 Output: SLogR (Method = Enter) Cox & Snell R2: • The cox & Snell R2 can be interprets like R2 in linear regression but can't reach a maximum value of 1. • In this study we can conclude that 19.4% of lung cancer is explained by radiation time. Nagelkerke R2: • It is a modification of Cox & Snell R2. • The Nagelkerke R2 can reach a maximum of 1.. MdRodiSPSS 355 Output: SLogR (Method = Enter) Hosmer & Lemeshow test: • Pearson Chi-square: to test the Ho that the number of observed and predicted outcomes are not different. • If the Chi-square statistic for its degrees of freedom would be smaller than the critical values giving a p-value > 0.05, it indicates that the Ho cannot be rejected and the model does fits well. • Of the other way round, the model does not fits well. • In this analysis, we found that the p-value was more than 0.05. • It can be say that there was no statistically significant (p>0.05). •MdRodiSPSS Therefore, we can conclude that the model does fits well.. 356 Output: SLogR (Method = Enter) • This is the contingency table of the sample MdRodiSPSS 357 Output: SLogR (Method = Enter) Classification Table: • It is explaining the sensitivity and specificity of the independent variable to the dependent variable. • It is not very important for simple logistic regression, but very important in multiple logistic regression, MdRodiSPSS 358 Output: SLogR (Method = Enter) We can now use this model to predict the odds of having lung cancer: • The odds prediction equation is: Odds = exp (a + bx) If the subject is non-smoker (No = 0), the • Odd = exp [-1.39 + 2.02 (0)] = exp (-1.39) = 0.25 • The non-smoker is 0.25 as likely to have lung cancer. If the subject is smoker (Yes = 1), then the: • Odds = exp [-1.39 + 2.02(1)] = exp 0.63 = 1.88 MdRodiSPSS • The smoker is 1.88 time likely to have lung cancer.. 359 Output: SLogR (Method = Enter) What is odds ratio (OR)? Odds of lung cancer in smoker • OR is the ratio between two odds: -------------------------------------------Odds of lung cancer in non-smoker Therefore, OR = 1.88 / 0.25 = 7.50 • which is almost equal to exp(B) in the “variable in the equation” table OR is thus equivalent to exp log odds of lung cancer = exp 2.02 = 7.50.. MdRodiSPSS 360 Output: SLogR (Method = Enter) Interpreting Odds Ratio (OR): • OR = exp 2.02 = 7.50 • It is telling that the model predicts that the odds of having lung cancer are 7.50 times higher for smoker compared to non-smoker. • IF lung cancer is a rare condition, OR is equivalent to Relative Risk (RR). MdRodiSPSS 361 Summary: SLogR Table: The association between smoking and lung cancer (N=106) Smoking (Yes vs No) B (SE) Wald a (df) OR (95%CI) p-value 2.02 (0.45) 20.30 (1) 7.50 (3.12, 18.02) <0.001* * statistically significant at ɑ=0.05 a Statistical test: Simple Logistic Regression OR Table: The association between smoking and lung cancer (N=106) Smoking Yes No B (SE) Wald a (df) OR (95%CI) p-value 2.02 (0.45) 20.30 (1) 7.50 (3.12, 18.02 1 <0.001* * statistically significant at ɑ=0.05 a Statistical test: Simple Logistic Regression MdRodiSPSS 362 Research Question (3) Research question: • Does diabetes status associated with myocardial infarction? • Independent vriable: Diabetes status (three categories: Normal, Impaired and Diabetes) • Dependent variable: Myocardial infarction (binary: Yes / No) Step 1: Specify Ho and Ha a) Null hypothesis: • Ho: There is no association between diabetes status and myocardial infarction. b) Alternate hypothesis: • Ha: The is an association between diabetes status and myocardial MdRodiSPSS 363 infarction.. Hands-on: SLogR 1 7 4 5 2 6 3 MdRodiSPSS 364 Hands-on: SLogR 9 8 MdRodiSPSS 365 Hands-on: SLogR 12 14 11 10 13 MdRodiSPSS 366 Hands-on: SLogR 15 16 17 MdRodiSPSS 18 367 Output: SLogR (Preliminary) • Here, we set “Normal” as reference group MdRodiSPSS 368 Output: SLogR (Beginning Block) MdRodiSPSS • This table shows that DM2 (Diabetes) is statistically significant compared to DM (reference group –369 Normal) (p=0.005) Output: SLogR (Method = Enter) Omnibus Test of Model Co-efficients: • Here, SPSS has added the “DM status” variable as a predictor. • The Model Coefficient gives a Chi-square of 9.677 on 2 degree of freedom (df) significant beyond 0.008. • This is a test of the Ho that adding the “DM status” variable to the model has significantly increased to predict myocardial infarction. • Therefore, the inference, adding “DM status” variable into the model will improves the model. • Since “DM status” is having 3 categories and “Normal” is the reference group, AT LEAST ONE comparison will be significant: MdRodiSPSS either: “Impaired DM” versus “Normal”; or “DM” versus “ Normal 370 Output: SLogR (Method = Enter) -2 Log Likelihood (LL): • Under model summary, we can see the -2 Log LL statistic is 135.416. • This statistic measures how poorly the model predict myocardial infarction - the smaller the statistic, the better the model. • Although SPSS does not give us this statistic for the model that had only the intercept, but you can derive it to be: 135.416 + 9.677 = 145.093. • Adding “DM status” variable reduced the -2LL statistic by 145.093 135.416 = 9.677, which is the x2 statistic as seen in the omnibus test.. MdRodiSPSS 371 Output: SLogR (Method = Enter) Cox & Snell R2: • The Cox & Snell R2 can be interprets like R2 in linear regression but can't reach a maximum value of 1. • In this study, we can conclude that 8.7% of myocardial infarction is explained by DM status. Nagelkerke R2: • It is a modification of Cox & Snell R2. • The Nagelkerke R2 can reach a maximum of 1.. MdRodiSPSS 372 Output: SLogR (Method = Enter) Hosmer & Lemeshow test: • Pearson Chi-square: to test the Ho that the number of observed and predicted outcomes are not different. • If the Chi-square statistic for its degrees of freedom would be smaller than the critical values giving a p-value > 0.05, it indicates that the Ho cannot be rejected and the model does fits well. • Of the other way round, the model does not fits well. • In this analysis, we found that the p-value was more than 0.05. • It can be say that there was no statistically significant (p>0.05). • Therefore, we can conclude that the model does fits well. MdRodiSPSS 373 Output: SLogR (Method = Enter) • This is the contingency table of the sample MdRodiSPSS 374 Output: SLogR (Method = Enter) Classification Table: • It is explaining the sensitivity and specificity of the independent variable to the dependent variable. • In this scenario: i. Sensitivity: 39.1% ii. Specificity: 85.0% MdRodiSPSS 375 Output: SLogR (Method = Enter) The variable in the Equation output shows the regression equation is: • Comparing between “Normal” and “Impaired DM”: • Log odds for Normal = b0 + b*normal = -0.981 + 0.629(0) • Log odds or Impaired DM = b0 + b*Impaired DM = -0.982 + 0.629(1) MdRodiSPSS 376 Output: SLogR (Method = Enter) • Log odds Impaired DM versus Normal: = Log odds Impaired DM - Log odds Normal = (-0.981 + 0.629) - (-0.981 + 0) = 0.629 Log odds Impaired DM versus Normal = 0.629 • OR = exp (0.629) = 1.877 • Impaired DM are at 2 times risk of having myocardial infarction compared to Normal. • However, the association was NOT statistically significant (p=0.201)..377 MdRodiSPSS Output: SLogR (Method = Enter) The variable in the Equation output shows the regression equation is: • Comparing between “Normal” and “Diabetes”: • Log odds for Normal = b0 + b*normal = -0.981 + 1.647(0) • Log odds or Impaired DM = b0 + b*Impaired DM = -0.982 + 1.647(1) MdRodiSPSS 378 Output: SLogR (Method = Enter) • Log odds Diabetes versus Normal: = Log odds Diabetes - Log odds Normal = (-0.981 + 1.674) - (-0.981 + 0) = 1.674 Log odds Diabetes versus Normal = 1.674 • OR = exp (1.674) = 5.33 • Diabetes are at 5 times risk of having myocardial infarction compared to Normal. • The association was statistically significant (p=0.003).. MdRodiSPSS 379 Summary: SLogR Table: The association between Diabeted status and Myocardial infarction (N=106) Diabetes status Normal Impaired DM Diabetes B (SE) Wald a (df) OR (95%CI) p-value 0.63 (0.49) 1.67 (0.57) 8.945 (2) 1.634 (1) 8.772 (1) 1 1.88 (0.72, 4.93) 5.33 (1.76, 16.15) ref. 0.201 0.003* * Statistically significant at ɑ=0.05 a Statistical test: Simple Linear Regression MdRodiSPSS 380 Thank you for your attention rodi@salam.uitm.edu.my MdRodiSPSS 381 Basic Biostatistics Workshop Using SPSS 2019 Lesson 9 Non Parametric Method Data Analyses Dr. Mohamad Rodi Isa MBBS (Malaya), DAP&E (SEAMEO-TROPMED, M’sia), MPH (Malaya), DrPH (Malaya) Public Health Medicine, UiTM MdRodiSPSS 382 Contents Recap Introduction to statistics. Lesson 1 Descriptive Statistics. Lesson 2 Inferential Statistics: Estimation & Hypothesis Testing. Lesson 3 Analyzing Quantitative Data: T-test. Lesson 4 Analyzing Quantitative Data: Analysis of Variance (ANOVA). Lesson 5 Analyzing Categorical Data. Lesson 6 Correlation Analyses. Lesson 7 Simple Linear Regression. Lesson 8 Simple Logistic Regression Lesson 9 Non-Parametric Method Data Analyses. MdRodiSPSS 383 Non-Parametric tests • It is used when the characteristics of the population from which a sample is drawn is unknown. • No assumption about the distribution of variable → Distribution free. • Dealing with small sample size. • Less powerful compared to parametric – less likely to find a true difference when it exists than the parametric test (replace by rank). • It is more robust when testing skewed data.. MdRodiSPSS 384 Non-Parametric tests Advantages: • It allow for the testing of hypothesis that are not statements about population parametric values. • The tests may be used when the form of the sampled population is unknown. • The procedures then to be computationally easier and consequently more quickly applied than parametric procedures. • The procedures may be applied when the data being analyzed consists merely of rankings or classification. Disadvantages: • The procedures with data that can be handled with a parametric procedures results in a waste of data. •MdRodiSPSS The application of some of the nonparametric tests may be 385 laborious for large samples.. Parametric vs Non-parametric Parametric Non-parametric Normal Any Homogenous Any Typical data Ratio or intervals Ordinal or Nominal Data set relationships Independent Any Usual central measure Mean (SD) Median (IQR) Can draw more conclusions Simplicity, less affected by outliers Assumed distribution Assumed Variance Benefits MdRodiSPSS 386 9.0 Non-Parametric Tests Descriptions Statistical test 1. Single distribution - • Kolmogorov-smirnov 2. Single sample Compare with hypothesize median • One-sample Wilcoxon Sign-Rank 3. Independent Comparing 2 medians • Mann-Whitney U Comparing ≥ 3 medians • Kruskal Wallis Continuous → categories • Median Comparing related 2 median groups • Sign test • Wilcoxon Match-Pair (Wilcoxon Pair-Rank) Comparing related ≥ 3 median groups • Friedman Continuous versus Continuous • Spearman-rank Ordinal versus Ordinal • Kendall's Tau Nominal versus Nominal • Phi-coefficient 4. Paired 5. Correlation MdRodiSPSS 387 Hypothesis testing for Non-Parametric • • • • • • • Step 1 : Specify the null and alternate hypotheses. Step 2 : Choose the significance level α, one or two sided (tailed). Step 3 : Check assumptions. Step 4 : Choose the test statistic. Step 5 : Distribution of test statistic Step 6 : Decision rule Step 7 : Calculate test statistics to find the p-value, - p > α → do not reject Ho → no sig. difference. - p < α → reject Ho → sig. difference. • Step 8 : Statistical and Problem conclusions. MdRodiSPSS 388 9.1 Non-Parametric Tests Descriptions 1. Single distribution MdRodiSPSS Statistical test - • Kolmogorov-smirnov 389 9.1 One-Sample KS • It is a non-parametric of the equality of cntinuous variable. • It is one dimensional probability distribution that can be used to compare: i. a sample with a sample probability distribution (one-sample KS test); or ii. two samples (two sample KS test) – this is under independent sample to determine whether the distribution of teo groups are equal (similar) or not.. MdRodiSPSS 390 One-Sample Kolmogorov-smirnov Research Question: • Is distribution of troponin normally distributed? • Variable: Troponin - a continuous variable. a) Null Hypothesis: • Ho: The data for troponin is follow a special distributed (normally distributed). b) Alternate hypothesis: • Ha: The data for troponin DO NOT follow a special distributed (not normally distributed). MdRodiSPSS 391 Hands-on: One-Sample KS (Method 1) 1 5 6 2 7 3 4 MdRodiSPSS 392 Output & Conclusion • Since the p-value (<0.001) is less than α (0.05), we reject the Ho and accept Ha. • Therefore, we can conclude that the distribution for troponin was not normally distributed.. MdRodiSPSS 393 Hands-on: One-Sample Kolmogorovsmirnov (Method 2) 1 4 5 2 MdRodiSPSS 3 394 Hands-on: One-Sample KS (Method 2) 6 9 7 10 8 11 12 MdRodiSPSS 395 Hands-on: One-Sample KS (Method 2) 13 15 14 MdRodiSPSS 396 Output & Conclusion p<0.001 • Since the p-value is less than 0.05 (p<0.001) at α=0.05, we reject the Ho and accept Ha. MdRodiSPSS • Therefore, we can conclude that the distribution for troponin was not 397 normally distributed.. 9.2 Non-Parametric Tests Descriptions 2. Single sample MdRodiSPSS Statistical test Compare with hypothesize median • One-sample Wilcoxon Sign-Rank 398 9.2 One-Sample Wilcoxon Sign-Rank • 1-Sample Wilcoxon Sign-Rank test is a non-parametric test alternative to 1-sample t-test when the data cannot be assumed to be normally distributed. • It is based on ranks and because of that, the location parameter is median. • It is used to determine whether the median of the sample is equal to a known standard value (i.e. theoretical value).. MdRodiSPSS 399 One-Sample Wilcoxon Sign-Rank Research Question: • You want to determine whether the level of Troponin is 1ng/mL. • Test variable: Troponin - a continuous variable. • Hypothesized median: 1ng/mL a) Null Hypothesis: • Ho: The median of troponin level is equal to 1ng/mL b) Alternate hypothesis: • Ha: The median of troponin level is NOT equal to 1ng/mL.. MdRodiSPSS 400 Hands-on: One-Sample Wilcoxon Sign-Rank 1 4 5 2 MdRodiSPSS 3 401 Hands-on: One-Sample Wilcoxon Sign-Rank 6 9 7 10 8 11 12 13 MdRodiSPSS 402 Output & Conclusion p=0.527 • Since the p-value (0.527) is more than α (0.05), we DO NOT reject the Ho. •MdRodiSPSS Therefore, we can conclude that the median of troponin is equal to (or403 NOT FAR from) 1ng/mL (or not far from 1ng/mL).. 9.3 Non-Parametric Tests Descriptions 3. Independent MdRodiSPSS Statistical test Comparing 2 medians • Mann-Whitney U Comparing ≥ 3 medians • Kruskal Wallis Continuous → categories • Median 404 9.3.1 Mann-Whitney U • It is a non-parametric test alternative to independent t-test. • It is also known as Wilcoxon Rank-sum test. • It can be used to compare two populations median, if the underlying data are not normally distributed, if the measurements are ordinal. • The rank sum test (Mann-Whitney U test) is performed on ranks rather than the actual measurements.. MdRodiSPSS 405 Mann-Whitney U Research Question: • You want to determine whether the median of cortisol level is different betwen male and female. a) Null Hypothesis: • Ho: There is no different in the median of cortisol level between male and female (Ho: θ1 = θ2). b) Alternate hypothesis: • Ha: There is no different in the median of cortisol level between male and female (Ho: θ1 ≠ θ2).. MdRodiSPSS 406 Hands-on: Mann-Whitney U (Method 1) 1 6 7 8 5 2 12 3 9 4 MdRodiSPSS 10 11 407 Output & Conclusion • Since the p-value (p=0.037) is less than α (0.05), we reject the Ho and accept Ha. • Therefore, we can conclude that the there was a significant different in the median of cortisol between male and female.. MdRodiSPSS • The median of cortisol level in male is significantly higher compared to female (p=0.037) [Median = 9.8 (IQR: 1.2) versus [Median = 408 8.50 (IQR: 3.1)] Output & Conclusion Table: The comparison of Cortisol level between male and female (N=15) Variable Gender n Median (IQR) za p-value Cortisol Male 7 9.80 (1.2) -2.087 0.037* Female 8 8.50 (3.1) * statistical significant at α=0.05 a Statistical test: Mann-Whitney U MdRodiSPSS 409 Hands-on: Mann-Whitney U (Method 2) 1 4 5 2 3 MdRodiSPSS 410 Hands-on: Mann-Whitney U (Method 2) 6 10 7 8 11 12 9 13 MdRodiSPSS 411 Output & Conclusion p=0.040 • Since the p-value is less than 0.05 (p=0.040) at α=0.05, we reject the Ho and accept Ha. •MdRodiSPSS Therefore, we can conclude that the there was a significant different in412 the median of cortisol between male and female.. Summary: Mann-Whitney U Table: The comparison of Cortisol level between male and female (N=15) Variable Gender n Median (IQR) test statistica p-value Cortisol Male 7 9.80 (1.2) 10.00 0.040* Female 8 8.50 (3.1) * statistical significant at α=0.05 a Statistical test: Mann-Whitney U MdRodiSPSS 413 9.3.2 Kruskal-Wallis test • It is a non-parametric test alternative to ANOVA. • It can be used to compare three or more groups if the underlying data are not normally distributed, or if the measurements are ordinal. • It is a test is performed on ranks rather than actual measurements.. MdRodiSPSS 414 Kruskal-Wallis test Research question: • Is there any different in the median of pack cell volume (PCV) between 3 group of races (Malay, Chinese and Indian). a) Null hypothesis: • Ho: The median of pack cell volume (PCV) between 3 group of races (Malay, Chinese and Indian) are the same. • Ho: θ1 = θ2 = θ3. b) Alternative hypothesis: • Ha: There is at least one pair different in the median of pack cell volume (PCV) between 3 group of races (Malay, Chinese and Indian).. MdRodiSPSS 415 Hands-on: Kruskal Wallis test (Method 1) 1 6 7 8 5 2 3 12 9 4 10 MdRodiSPSS 11 416 Output: Kruskal Wallis test • Since the p-value (p=0.005) is less than α (0.05), we reject Ho and accept Ha. • Therefore, we conclude that there is at least one pair different in the median of pack cell volume (PCV) between 3 group of races (Malay, Chinese and Indian) Which Pair? • Since there ia no post-hoc or pairwise comparison method in this method, we need to analyse using Mann-Whitney U test, pair by pair MdRodiSPSS 417 with an adjustment. Hands-on: Mann-Whitney U test [Malay (1) versus Chinese (2)] 1 6 7 8 5 12 2 3 9 4 10 11 Then: - change the group: (1, Malay) and (3, Indian) and MdRodiSPSS - change the group: (2, Chinese) and (3, Indian) 418 Output: Kruskal Wallis test Malay versus Chinese Malay versus Indian Chinese versus Indian • It was found that all pairs were statistically significant (p<0.05) at α=0.05. • However, we need to do an adjustment by times the p-value by the number of group to get the adjusted p-value (it is called Bonferroni correction) MdRodiSPSS 419 Output: Kruskal Wallis test • Therefore: i. between Malay and Chinese: 0.014 x 3 = 0.042 ii. between Malay and Indian: 0.014 x 3 0.042 iii. between Chinese and Indian; 0.020 x 3 = 0.060 • The significant pairs were between: i. Malay and Chinese (p=0.042); and ii. Malay and Indian (p=0.042) MdRodiSPSS 420 Output: Summary Table: Median Pack Cell Volume by race (N=13) Variables Race n Median (IQR) Test statistica (df) p-value Pack Cell Volume Malay 5 31.00 (19.0) 0.005*b Chinese 4 8.00 (1.5) 10.711 (2) Indian 4 3.50 (2.5) * statistically significant at α=0.05 a Kruskal Wallis test b The significant difference is between Malay and Indian (p=0.042) and Malay and CHinese (p=0.042) by pairwise comparison (Bonferroni adjustment).. • The median Pack Cell Volume for Malay is higher than Chinese [31.00 (IQR: 19.0) versus 8.00 (IQR: 1.5), p=0.042] and the median Pack Cell Volume for Malay is higher than Indian [31.00 (IQR: 19.0) versus 3.5 MdRodiSPSS 421 (IQR: 2.5), p=0.042].. Hands-on: Kruskal Wallis test (Method 2) 1 4 5 2 3 MdRodiSPSS 422 Hands-on: Kruskal Wallis test (Method 2) 6 10 7 8 11 12 9 13 MdRodiSPSS 423 Output: Kruskal Wallis Test p=0.005 • Since the p-value (p=0.005) is less than α (0.05), we reject Ho and accept Ha. • Therefore, we can conclude that there is at least one pair different in the median of pack cell volume (PCV) between 3 group of races (Malay, Chinese and Indian) MdRodiSPSS Which pair ? Malay – Chinese? Malay – Indian? Chinese – Indian? 424 Output: Kruskal Wallis Test MdRodiSPSS Drop-up menu: •Select pairwise comparison 425 Output: Kruskal Wallis Test • The significant pair different is between Indian and Malay by pairwise comparison (p=0.003) • The other pairs were not statistically significant (p>0.05).. MdRodiSPSS 426 Output: Summary Table: Median Pack Cell Volume by race (N=13) Variables Race n Median (IQR) Test statistica (df) p-value Pack Cell Volume Malay 5 31.00 (19.0) 0.005*b Chinese 4 8.00 (1.5) 10.711 (2) Indian 4 3.50 (2.5) * statistically significant at α=0.05 a Kruskal Wallis test b The significant difference is between Malay and Indian (p=0.042) and Malay and CHinese (p=0.042) by pairwise comparison • The median Pack Cell Volume for Malay is higher than Chinese [31.00 (IQR: 19.0) versus 8.00 (IQR: 1.5), p=0.003]. MdRodiSPSS 427 9.3.3 Median test • It is a non-parametric procedure used to test the Ho that two independent samples have been drawn from population with equal medians. • It is almost the same like chi-square test for independence, however we need to find the median of the dependent variable first before we can divided the dependent variable into two groups.. MdRodiSPSS 428 Median test Research question: • A research want to determine whether the median endorphin level between those who had heart disease and no heart disease. a) Null hypothesis: • Ho: There is no different in the median endorphin level between those who had heart disease and no heart disease. b) Alternate hypothesis: • Ha: There is a different in the median endorphin level between those who had heart disease and no heart disease.. MdRodiSPSS 429 Hands-on: Explore First: we need to explore the dependent variable (Endorphine) to find the median. 1 4 2 3 5 MdRodiSPSS 430 Output: Explore • The median for Endorphin level is 8.5 • Therefore, we use 8.5 as cut-out point. MdRodiSPSS 431 Hands-on: Median Test 1 4 5 2 3 MdRodiSPSS 432 Hands-on: Median Test 6 10 7 11 8 12 13 9 14 MdRodiSPSS 433 Output: Median Test p=1.000 • Since the p-value (p=1.000) is more than α (0.05), we DO NOT reject Ho. • We can conclude that there is no different in the median endorphin level between those who had heart disease and no heart disease (median is at MdRodiSPSS 434 8.5 µmol/L).. Output: Summary Cross tabulation Table: The cross-tabulation between heart disease and endorphine level (N=21) Heart disease Endorphine More than 8.5. level Less than 8.5. a Median test (at 8.50) MdRodiSPSS Total, freq, n(%) Test statistica p-value 0.043 (1) 1.000# Yes Freg., n(%) No Freq, n(%) 5 (45.5) 6 (54.5) 11 (100.0) 6 (60.0) 4 (40.0) 10 (100.0) 435 9.4 Non-Parametric Tests Descriptions 4. Paired MdRodiSPSS Statistical test Comparing related 2 median groups • Sign test • Wilcoxon Match-Pair (Wilcoxon Pair-Rank) Comparing related ≥ 3 median groups • Friedman 436 9.4.1 Paired data - Sign test • The “paired-sample sign test” is used to determine whether there is a median difference between paired or matched observation. • It can be considered as an alternative to the dependent t-test or Wilcoxon signed-rank test when the distribution of difference between paired observations is either not normal or asymmetrical. • The sign test may be employed to test the Ho that the median difference is 0.. MdRodiSPSS 437 Paired data - Sign test • The “paired-sample sign test” is used to determine whether there is a median difference between paired or matched observation. • It can be considered as an alternative to the dependent t-test or Wilcoxon signed-rank test when the distribution of difference between paired observations is either not normal or asymmetrical. • The sign test may be employed to test the Ho that the median difference is 0. • Assumptions: i. The dependent variable should be on continuous or ordinal. ii. Independent variable should consists of two categorical, “related group” or match-pairs”. iii. The paired observations for each participant need to be independent. iv. The difference score (i.e. difference between the paired observations) is from a contiuous distribution.. MdRodiSPSS 438 Sign test – Paired data Research question: • Does dopamine level differ before and after surgery? a) Null hypothesis: • Ho: Pre-surgery and post-surgery median dopamine levels are the same. - Ho: Median difference = 0. b) Alternate hypothesis: • Ha: Pre-surgery and post-surgery median dopamine levels are NOT the same. - Ha: Median difference ≠ 0.. MdRodiSPSS 439 Hands-on: Sign test – Paired data 1 4 5 2 3 MdRodiSPSS 440 Hands-on: Sign test – Paired data 6 9 10 7 11 8 12 MdRodiSPSS 441 Output: Sign test – Paired data p=0.002 • Since the p-value (p=0.002) is less than α (0.05), therefore we reject Ho and accept Ha. • We can conclude than the median of dopamine if pre-surgery and postMdRodiSPSS 442 surgery are significantly NOT the same.. Output: Summary • The median of dopamine level before surgery was significantly higher than after surgery [Median: 9.50 (IQR: 3) versus Median: 8.00 (IQR: 2), p=0.002] Table: The comparison of dopamine level pre-sergury and post-surgery (N=16) N Median (IQR) Test statistica p-value Dopamine (pre-surgery) 16 9.50 (3.00) 1.000 0.002* Dopamine (post-surgery 16 8.00 (3.00) * statistically significant at ɑ=0.05 a Statistical test: Sign test - Paired data MdRodiSPSS 443 9.4.2 Wilcoxon signed-rank test • It is a non-parametric test alternative to two dependent samples. • It is also known as Wilcoxon Match-Pair or Wilcoxon Pair-Rank test. • Primarily, it is used when the underlying populations of differences cannot be assumed to be normally distributed. • It can also be used when data are ordinal rather than discrete. • Because it relied on ranks, the non-parametric test is less sensitive to measurement error and to outlying values.. MdRodiSPSS 444 Wilcoxon signed-rank test Research question: • Does the of Calcium levels differ before and after pregnancy? a) Null hypothesis: • Ho: Pre-pregnancy and post-pregnancy median Calcium levels are the same. - Ho: Median difference = 0. b) Alternate hypothesis: • Ha: Pre-pregnancy and post-pregnancy median Calcium levels are NOT the same. - Ha: Median difference ≠ 0.. MdRodiSPSS 445 Hands-on: Wilcoxon signed-rank test (Method 1) 1 4 5 2 3 MdRodiSPSS 446 Hands-on: Wilcoxon signed-rank test (Method 1) 6 9 10 7 11 8 12 MdRodiSPSS 447 Output: Wilcoxon signed-rank test p=0.176 • Since the p-value (p=0.176) is more than α (0.05), we DO NOT reject Ho. • Therefore, we can conclude that the pre-pregnancy and post-pregnancy median Calcium levels are the same.. MdRodiSPSS 448 Output: Summary • The median of calcium level pre-pregnancy was higher than postpregnancy [Median: 10.30 (IQR: 3.0) versus Median: 9.80 (IQR: 2.7)]. However, the difference was not statistically significant (p=0.176 Table: The comparison of dopamine level pre-sergury and post-surgery (N=16) N Median (IQR) Test statistica p-value Calcium (pre-pregnancy) 16 10.30 (3.00) 31.000 0.176 Calcium (post-pregnancy) 16 9.80 (2.70) a Statistical MdRodiSPSS test: Wilcoxon Match-Pair test 449 Hands-on: Wilcoxon signed-rank test (Method 2) 1 5 6 7 2 3 4 MdRodiSPSS 450 Output: Wilcoxon signed-rank test (Method 2) Table: The comparison of dopamine level pre-sergury and post-surgery (N=16) N Median (IQR) Test statistica p-value Calcium (pre-pregnancy) 16 10.30 (3.00) 31.000 0.176 Calcium (post-pregnancy) 16 9.80 (2.70) a Statistical MdRodiSPSS test: Wilcoxon Match-Pair test 451 9.4.3 Friedman Test • The Friedman test is a non-parametric method, alternative to the parametric one-way repeated measures ANOVA ranks. • In its use of ranks it is similar to the Kruskal–Wallis one-way ANOVA by ranks. • It is used to detect differences in treatments across multiple test attempts. • The procedure involves ranking each row (or block) together, then considering the values of ranks by columns.. MdRodiSPSS 452 Friedman Test Research question: • A group of researcher conducts a study to compare three methods of stimulators by give then grading. They want to determine is there any different in the grade between stimulators (Model A, B and C). a) Null hypothesis: • Ho: The median grade of three models are equal. b) Alternative hypothesis: • Ha: The median grade of three models are not equal. There is at least one pair different in the median grade.. MdRodiSPSS 453 Hands-on: Friedman Test (Method 1) 1 5 6 2 7 3 4 MdRodiSPSS 454 Output: Friedman Test • Since the p-value (p=0.013) is less than α (0.05), we reject the Ho and accept Ha. • Therefore, we can conclude that there is at least one pair group different in the median grade between 3 group of models (A, B and C). Which pair? • Since there is n post-hoc or pairwise comparison method in this method, we need to analyze using Wilcoxon signed-rank test, pair by pair with an adjustment.. MdRodiSPSS 455 Hands-on: Wilcoxon signed-rank test 1 5 6 7 2 3 4 MdRodiSPSS 456 Output: Friedman Test • It was found that two pairs were statistically significant (Model A versus Model B and Model B versus Model C) • However, we need to do an adjustment by times the p-value by the number of group to get the adjusted p-value (it is called Bonferroni correction) • Therefore: - between Model A and Model B: 0.020 x 3 = 0.060 - between Model B and Model C: 0.21 x 3 = 0.063 • After Bonferroni adjustment, non pair found to be significant MdRodiSPSS 457 difference.. Summary: Friedman Test Descriptive part Table: Median Grade between Stimulator model (N=9) Grade a N Model A, Median (IQR) Model B, Median (IQR) Model C, Median (IQR) Chisquarea (df) p-value 9 2.00 (1) 3.00 (1) 2.00 (1) 8.706 (2) 0.013* Statistical Test: Friedman test MdRodiSPSS 458 Hands-on: Friedman Test (Method 2) 1 4 5 6 2 3 MdRodiSPSS 459 Hands-on: Friedman Test (Method 2) 7 10 11 8 9 12 13 14 MdRodiSPSS 460 Output: Friendman Test p=0.013 • Since the p-value (p=0.013) is less than α (0.05), we reject Ho and accept Ha. • Therefore, we can conclude that there is at least one pair different in the median of gradebetween 3 models (A, B and C) MdRodiSPSS Which pair ? A-B A-C B-C 461 Output: Friedman Test Drop-up menu: • Select: Pairwise Comparison MdRodiSPSS 462 Output: Friedman Test • The significant pair different is between Model A and Model C by pairwise comparison (p=0.029) MdRodiSPSS 463 Summary: Friedman Test Descriptive part • The median grade Model A is less than median grade Model B [ 2.00 (IQR: 1) versus 3.00 (IQR:1)] Table: Median Grade between the Stimulator Model (N=9) Grade n Model A [Median (IQR)] Model B [Median (IQR)] Model C [Median (IQR)] Test statistica (df) P-value 9 2.00 (1) 3.00 (1) 2.00 (1) 8.706 (2) 0.013*b * statistically significant at α=0.05 a Friedman test b The significant difference is between Stimulator Model A and Model C (p=0.029) by MdRodiSPSS pairwise comparison 464 9.5 Non-Parametric Tests Descriptions 5. Correlation MdRodiSPSS Statistical test Continuous versus Continuous • Spearman-rank Ordinal versus Ordinal • Kendall's Tau Nominal versus Nominal • Phi-coefficient 465 9.5.1 Spearman rank Correlation • It is a bivariate measure of correlation / association that is employed with rank-order data. • It represents the degree of relationship between two or more variables. • The data for both are in a rank-order format.. MdRodiSPSS 466 Spearman rank Correlation Research question: • You want to determine is there any correlation between depression score and anxiety score. a) Null hypothesis: • Ho: There is no significant (linear) rank correlation between depression score and anxiety score (Ho: rs = 0) b) Alternative hypothesis: • Ha: There is a significant (linear) rank correlation between depression score and anxiety score. (Ha: rs ≠ 0) MdRodiSPSS 467 Hands-on: checking for normality (depression and anxiety) 1 5 2 6 3 7 4 MdRodiSPSS 468 Output: Scatter plot • The p-value of both variables (depression and stress) were found statistically significant (p<0.05). • Therefore, we can conclude that both data (depression and stress) were not normally distributed.. MdRodiSPSS 469 Hands-on: Scatter plot 1 4 2 5 3 MdRodiSPSS 470 Hands-on: Scatter plot 6 7 8 MdRodiSPSS 471 Output: Scatter plot • From the scatter plot, we have a rough idea that there could be a positive rank correlation between depression and stress scores. • It is significant? MdRodiSPSS 472 Hands-on: Spearman rank Correlation 1 4 2 3 5 6 MdRodiSPSS 473 Output: Spearman rank Correlation • rs = 0.888 • Since p-value (<0.001) is less than α (0.05), we reject the Ho and accept the Ha. • The correlation is positive and the strength is strong. • Statistical conclusion: There is a significant positive strong rank correlation between depression and stress scores (p<0.001). • Problem conclusion: There is a positive strong rank correlation between depression and stress scores (rs=0.888) • We can conclude that when the depression score increases, the anxiety MdRodiSPSS 474 score will also increases.. 9.5.2 Kendal Rank correlation coefficient • It is a statistic used to measure the strength and direction of association that exists between two variables measured on at least an ordinal scale. • A “tau” test is a non-parametric hypothesis for statistical dependence based on the tau-coefficient. • Three types: i. Tau-a - test the strength of association the the cross tabulation and both variables have to be ordinal. Tau-a will not make any adjustment for ties. ii. Tau b - unlike Tau-a, it makes an adjustment for ties. iii.Tau-c - is more suitable than Tau-b for the analysis of data based on non-square (i.e. contingency tables).. MdRodiSPSS 475 Kendal Rank correlation coefficient Example: • The correlation between examination grade and time spend revising (i.e. where there were six examination grade - A, B, C, D, E and F) and revision time (less than 5 hours; 5 - 9 hours; 10 14 hours; 15 - 19 hours; and 20 hours or more). • The correlation between customer satisfaction (dissatisfaction, satisfaction and very satisfaction) and delivery time (measured in days). Assumptions: i. The two variables should be measured in an ordinal or continuous scale. ii. Kendall's tau-b determines whether there is a monotonic relationship between two variables.. MdRodiSPSS 476 Kendal Rank correlation coefficient Research question: • Is there any correlation between educational status and salary. • x1: educational status: Primary; Secondary; and Tertiary • x2 : salary - less than RM999; RM1,000 - RM4,999; RM5,000 RM9,999; and more than RM10,000. Null Hypothesis: • Ho: There is a correlation between educational status and salary Alternate Hypothesis: • Ha: There is no correlation between educational status and salary.. MdRodiSPSS 477 Hands-on: Kendal tau-b 1 2 4 3 MdRodiSPSS 6 5 478 Hands-on: Kendal tau-b 7 9 8 MdRodiSPSS 479 Hands-on: Kendal tau-b • Since the p-value (p=0.006) is less than α (0.05), we reject Ho and accept Ha. • Therefore we can conclude that there is a significant correlation between educational status and salary. • There is a weak positive correlation between educational status and MdRodiSPSS salary (tau-b: 0.219).. 480 Output: Kendal tau-b Table: The correlation between Educational level and salary Educational level Salary: - Less than RM999 - RM1,000 - RM4,999 - RM5,000 - RM9,999 - More than RM10,000 No Formal Primary Secondary Tertiary 13 3 2 1 16 7 4 3 10 7 13 4 8 7 4 4 Valuea p-value 0.219 0.006* * Statistically significant at ɑ=0.05 a Statistical test: kendall's Tau-b MdRodiSPSS 481 9.5.3 Phi-coefficient • It is a non-parametric test of relationship that operates on two dichotomous variables. • It intersects variables a cross a 2 x 2 matric to estimate whether there is a non-random pattern across the four cells in the 2 x 2 matrix. • The signs [positive (+ve) or negative (-ve)] are relevant: i. A positive phi-coefficient indicates that the most of the data are in the diagonal cells. ii. A negative phi-coefficient indicates that the most of the data are in the off-diagonal cells. • The main thing to consider is the strength of the relationship between two variables and look at the 2 x 2 matrix to determine what is means.. MdRodiSPSS 482 Phi-coefficient Research question: • Is there any correlation between smoking and heart disease? a) Null hypothesis: • Ho: There is no correlation between smoking and heart disease b) Alternate hypothesis: • Ha: There is a correlation between smoking and heart disease. MdRodiSPSS 483 Hands-on: Phi-coefficient 1 4 2 3 MdRodiSPSS 6 5 484 Hands-on: Phi-coefficient 7 8 MdRodiSPSS 9 485 Output : Phi-coefficient • Since the p-value (p<0.001) is less than α (0.05), we reject Ho and accept Ha. • Therefore, we can conclude that there is a significant correlation between smoking and heart disease. • There is a weak positive correlation between smoking and heart MdRodiSPSS disease.. 486 Summary : Phi-coefficient Table: the correlation between smoking and heart disease (N=106) Heart disease Smoking status Yes No Yes 30 16 No 12 48 Phi-coefficienta p-value 0.548 <0.001* * Statistical significant at α=0.05 a Statistical test: Phi-coefficient MdRodiSPSS 487 Thank you for your attention rodi@salam.uitm.edu.my MdRodiSPSS 488