Essence of Biostatistics Why need to learn biostatistics? l Essential for scientific method of investigation – – – – – l Formulate hypothesis Design study to objectively test hypothesis Collect reliable and unbiased data Process and evaluate data rigorously Interpret and draw appropriate conclusions Essential for understanding, appraisal and critique of scientific literature Type of statistical variables l Descriptive (categorical) variables – Nominal variables (no order between values): gender, eye color, race group, ... – Ordinal variables (inherent order among values): response to treatment: none, slow, moderate, fast l Measurement variables – Continuous measurement variable: height, weight, blood pressure ... – Discrete measurement variable (values are integers): number of siblings, the number of times a person has been admitted to a hospital ... Statistical variables l It is important to be able to distinguish different types of statistical variables and the data they generate as the kind of statistical indices and charts and the type of statistical tests used depend on knowledge of these basics Types ofTypes statistical methods of statistical methods l •Descriptive Descriptivestatistical statisticalmethods methods – –Provide Providesummary summaryindices indicesforfora agiven givendata, data,e.g. e.g. arithmetic arithmeticmean, mean,median, median,standard standarddeviation, deviation, coefficient coefficientofofvariation, variation,etc. etc. l Inductive (inferential) statistical methods • Inductive (inferential) statistical methods – Produce statistical inferences about a population –based Produce statistical inferences aboutderived a population on information from a sample from the based on information a sample derived from population, need to takefrom variation into account the population, need to take variation into account sample 7 Population Estimating population values from sample values Copyright 2013 © Limsoon Wong, Summarizing data l l l Statistic is “making sense of data” Raw data have to be processed and summarized before one can make sense of data Summary can take the form of – Summary index: using a single value to summarize data from a study variable – Tables – Diagrams Summarizing categorical data Summarizing categorical data • A Proportion is a type of fraction in which the numerator is a subset of denominator l Athe Proportion is a type of fraction in which the numerator is a subset of – denominator proportion dead = 35/86 = 0.41 the • Odds are fractions where the numerator is not part of the denominator – proportion dead = 35/86 = 0.41 – Odds in favor of death = 35/51 = 0.69 A Ratio is fractions a comparison of two l • Odds are where the numbers numerator is not part of the denominator – – ratio of dead: = 35: 51 Odds in favor of alive death = 35/51 = 0.69 • Odds ratio: commonly used in case-control study l A– Ratio comparison of females two numbers Oddsisinafavor of death for = 12/25 – = ratio 0.48 of dead: alive = 35: 51 – Odds in favor of deathused for males = 23/26 = 0.88study l Odds ratio: commonly in case-control – Odds ratio = 0.88/0.48 = 1.84 – Odds in favor of death for females = 12/25 = 0.48 – Odds in favor of death for males = 23/26 = 0.88 – Odds ratio = 0.88/0.48 = 1.84 Copyright 2013 © Limsoon Wong, Summarizing measurement data l Distribution patterns – Symmetrical (bell-shaped) distribution, e.g. normal distribution – Skewed distribution – Bimodal and multimodal distribution l Indices of central tendency – Mean, median l Indices of dispersion – Variance, standard deviation, coefficient of variance Distribution patterns Indices ofIndices centraloftendency central tendency Indices of central tendency l (Arithmetic) mean: set of of values values • (Arithmetic) mean:Average Average of of a a set • (Arithmetic) mean: Average of a set of values n n XX l l i 1 i 1 XXi i nn •• Mean sensitive extreme values Mean sensitive toextreme extremevalues values Mean is isis sensitive toto •• Example: blood pressure reading Example: blood pressure reading Example: blood pressure reading Copyright 2013 © Limsoon Wong, Copyright 2013 © Limsoon Wong, Robust measure of central tendency Robust measure of central tendency Median: The number separating the higher half of • Median: Thea number separating the higher from half ofthe a sample, population, or a population a sample, lower halfa population, or a population from the lower half l Median is less sensitive to extreme values l • Median is less sensitive to extreme values Copyright 2013 © Limsoon Wong, Indices of central tendency: Quantiles Indices of central tendency: Quantiles l Quantiles: Dividing the distribution of ordered • Quantiles: Dividing theparts distribution of ordered values into equal-sized values into equal-sized parts – Quartiles: 4 equal parts – Quartiles: 4 equal parts – Deciles: 10 equal parts – Deciles: 10 equal parts – Percentiles: 100 equal parts – Percentiles: 100 equal parts First 25% Second 25% Q1 Third 25% Q2 Fourth 25% Q3 Q1: first quartile Q2 : second quartile = median Q3: third quartile Copyright 2013 © Limsoon Wong, Indices of dispersion Indices of dispersion l Summarize the dispersion of individual values l from some central value like the mean • Summarize the dispersion of individual values Give from a measure of variation some central value like the mean • Give a measure of variation mean x x x x x x Copyright 2013 © Limsoon Wo Indices of dispersion: Variance of dispersion: Indices ofIndices dispersion: Variance Variance Variance : Average of squares of deviation from Varianceof: squares Averageof ofdeviation squares of deviation fro the mean•: Average • Variance from l the mean n the mean n ( Xi X) 2 ( Xi X )2 i 1 i 1 n n Variance ofUsually a sample: Usually subtract Variance•ofof a sample: Usually subtract 1 nfrom •l Variance a sample: subtract 1 from in 1nfrom the denominator in the denominator the denominator n n ( Xi X) ( Xi 2 2 effective sample size, X )effective sample size, i 1 i 1 n 1 n 1 also also called degree of called degree of freedom freedom Copyright 2013 © Copyright 2013 © Limsoon Wong, Indices of dispersion: Standard deviation • Problem with variance: Awkward unit of measurement as value are squared l Problem with variance: Awkward unit of l l measurement as value are squared Solution: Take square root of variance => Solution:• Take square root of variance => standard deviation standard deviation Sample standard (s or sd)(s or sd) • Sample deviation standard deviation n ( Xi X )2 i 1 n 1 Copyright 2013 © L 18 Standard deviation l Standard deviation Caution must be exercised when using standard deviation as a comparative index of dispersion • Caution must be exercised when using standard deviation as a comparative index of dispersion Weights of newborn elephants (kg) Weights of newborn mice (kg) 929 853 0.72 0.42 878 939 0.63 0.31 895 972 0.59 0.38 937 841 0.79 0.96 801 826 1.06 0.89 n=10, X = 887.1, sd = 56.50 n=10, X = 0.68, sd = 0.255 Incorrect to say that elephants show greater variation for birth-weights than mice because of higher standard deviation Copyright 2013 © Limsoon Wong, Coefficient of variance Coefficient ofCoefficient variance of variance Coefficient of variance expresses standard nt of variance expresses standard relative to its expresses mean •deviation Coefficient of variance standard l relative to its mean s mean deviation relative to its cv X cv s X newborn Weights of newborn Weights of newborn Weights of newborn kg) mice (kg) elephants (kg) mice (kg) 853 0.72 0.42 929 853 0.72 0.42 939 0.63 0.31 878 939 0.63 0.31 972 0.59 0.38 895 972 0.59 0.38 841 0.79 0.96 937 841 0.79 0.96 826 1.06 0.89 801 826 1.06 0.89 Mice show 87.1 X greater n=10, X =n=10, 887.1 = 0.68 X =birthn=10, 0.68 v = 0.0637 s = 0.255, cv = 0.375 weight variation s = 56.50, cv = 0.0637 s = 0.255, cv = 0.375 Mice show greater birthweight variation Copyright 2013 © Limsoon Wong, Copyright 2013 © Limsoon Wong, When to use coefficient of variance l l l When comparison groups have very different means (CV is suitable as it expresses the standard deviation relative to its corresponding mean) When different units of measurements are involved, e.g. group 1 unit is mm, and group 2 unit is mg (CV is suitable for comparison as it is unit- free) In such cases, standard deviation should not be used for comparison Sample and population l l l l Sample and population Populations are rarely studied because of logistical, financial and other considerations • Populations are rarely studied because of Researchers to rely study samples logistical,have financial andon other considerations • Researchers have todesign rely on study samples Many types of sampling • Many types sampling design Most common isof simple random sampling • Most common is simple random sampling Population sample Copyright 2013 © Limsoon W Random sampling Random sampling l Suppose themean meanbirthbirth• Supposethat thatwe wewant want to to estimate estimate the l weights ofofMalay in Singapore Singapore weights Malaymale male live live births births in • Due logisticalconstraints, constraints,we we decide decide to Due totologistical totake takeaa random sampleof of100 100 Malay Malay live random sample livebirths birthsatatthe the National UniversityHospital Hospital in in a a given National University givenyear year All NUH Malay live births, 1992 All Malay live births in Singapore sample Sampled population Target population Copyright 2013 © Limsoon Wong, 23 Sample, sampled and target population Sample, sampled population, and target population All NUH Malay live births, 1992 All Malay live births in Singapore random sample of 100 Malay live births at NUH, 1992 Sample mean X = 3.5 kg Sample sd = 0.25kg • Suppose that we know the mean birth weight of sampled population to be 3.27kg with = 0.38kg • X – = 0.23kg Copyright 2013 © Limsoon Wong, Sampling error l l l Could the difference of 0.23 kg =(3.5kg-3.27kg) be real or could it be due purely to chance in sampling? ‘Apparent’ different between population mean and the random sample mean that is due purely to chance in sampling is called sampling error Sampling error does not mean that a mistake has been made in the process of sampling but variation experienced due to the process of sampling – Sampling error reflects the difference between the value derived from the sample and the true population value l The only way to eliminate sampling error is to enumerate the entire population Estimating sampling error Estimating sampling error 25 X =3.5 kg All NUH live births, 1992 X =2.9 kg Repeated sampling with replacement using the same sample size X =3.8 kg Sample size = constant = 100 X =3.1 kg X =2.8 kg X =3.5 kg etc Copyright 2013 © Limsoon Wong, Distribution of sample means l l l Also known as sampling distribution of the mean Each unit of observation in the sampling distribution is a sample mean Spread of the sampling distribution gives a measure of the magnitude of sampling error are large, sampling X X X Sampling generated distribution of the X Xmean X X distribution by repeated random • Mean of sampling l Central limit theorem: When sample sizes are large, sampling with distribution = random sampling distribution generated by repeated replacement sampling withisreplacement ispopulation invariably a mean normal= invariably a regardless normal of the shape of the population distribution distribution distribution regardless • Standard error of the of the shape of thedistributionsample l Mean of sampling = population =μ meanmean = population l Standard error of the sample mean S.E.X == n distribution Copyright 2013 © Limsoon Wong, Standard deviation (s.d.) tells us variability Standard deviation vs. standard error among individuals Standard deviation (s.d.) tells us variability among individuals Standard error (S.E.X ) tells us variability of sample meanserror (S.E. ) tells us variability of X l Standard sample means l Standard of the mean = S.E Standard errorerror of the mean = S.E. = X = – l n – σ : standard deviation of the population : standard deviation of the population Copyright 2013 © Limsoon W Properties of normal distribution l l Unimodal and symmetrical Probability distribution – Area under normal curve is 1 l For a normal distribution – μ±1.00σ is ~68% of area under the normal curve – μ±1.96σ is ~95% of area under the normal curve – μ±2.58σ is ~99% of area under the normal curve Estimation and hypothesis testing l Statistical estimation – Estimating population parameters based on sample statistics – Application of the “Confidence interval” l Hypothesis testing – Testing certain assumptions about the population by using probabilities to estimate the likelihood of the results obtained in the sample(s) given the assumptions about the population – Application of “Test for statistical significance” Statistical estimation l Two ways to estimate population values from sample values – Point estimation – Using a sample statistic to estimate a population parameter based on a single value – e.g. if a random sample of Malay births gave =3.5kg, and we use it to estimate μ , the mean birthweight of all Malay births in the sampled population, we are making a point estimation – Point estimation ignores sampling error l Interval estimation – using a sample statistic to estimate a population parameter by making allowance for sample variation (error) Interval estimation l Provide an estimation of the population parameter by defining an interval or range within which the population parameter could be found with a given probability or likelihood l This interval is called Confidence interval l In order to understand confidence interval, we need to return to the discussion of sampling distribution Central limit theorem l l Repeated sampling with replacement gives a distribution of sample means which is normally distributed and with a mean which is the true population mean, μ Assumptions: – Large and constant sample size – Repeated sampling with replacement – Samples are randomly taken Sampling distribution of the mean 95% confidence interval l The 95% confidence interval gives an interval of values within which there is a 95% chance of locating the true population mean 95% confidence interval 95% chance of finding X 1.96 n X within this interval X +1.96 n Standard error of the sample mean S.E.X • The 95% confidence interval gives an interval of values within which there is a 95% chance of locating the true population mean 36 Estimating standard error Estimating standard error • The sampling distribution is only a theoretical l The samplingas distribution theoretical distribution in practiceis weonly takeaonly one distribution asnot in repeated practice we take only one sample and sample l sample and not repeated sample not known •Hence HenceS.E S.E.isX often is often not knownbut butcan canbe be estimatedfrom fromaasingle single sample sample estimated s S.E. X n deviationand and n is –Where Wheressisissample sample standard standard deviation n is thethe sample samplesize size Copyright 2013 © Limsoon Wong, Precision of statistical estimation Precision of statistical estimation Low precision estimation X 1.96S.E.X X X +1.96S.E.X high precision estimation X 1.96S.EX. l l X X +1.96S.E.X • Width confidence interval gives a measure of the of Width of of a aconfidence interval gives a measure precision of the statistical estimation the precision of the statistical estimation Estimation ofof population valuevalue can achieve higher Estimation population can achieve precision by minimizing S.E. X , which depends on the higher precision by minimizing S.E., which X can be population s.d. and sample size, that is S.E. minimized sample to a certain depends on by themaximizing population s.d. size and(up sample size,point) that is S.E. can be minimized by maximizing Copyright 2013 © Limsoon sample size (up to a certain point) Statistical hypothesis testing l l l A statistical hypothesis testing is a method of making statistical decisions using experimental data. A result is called statistically significant if it is unlikely to have occurred by chance. These decisions are almost always made using null-hypothesis tests, that is, ones that answer the question: – Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed? Hypothesis testing l Another way of statistical inference in which we want to ask a question like – How likely is the mean systolic blood pressure from a sampled population (e.g. biomedical researchers) the same as those in the general population? i.e. μ researchers =μ general ? – Is the difference between X1 and X2 statistically significant for us to reject the hypothesis that their corresponding μ1 and μ2 are the same? i.e. μ male = μfemale ? Steps for test of significance l l l A test of significance can only be done on a difference, e.g. difference between X and μ , X1 and X2 1. Decide the difference to be tested, e.g. difference in pulse rate between those who were subjected to a stress test and the controls, on the assumption that the stress test significantly increases pulse rate (the hypothesis) 2. Formulate a Null Hypothesis, e.g. no difference in pulse rate between the two groups, – i.e. H0: μtest=μcontrol – Alternative hypothesis H1 : μtest ≠ μcontrol Steps for test of significance l l l 3. Carry out the appropriate test of significance. Based on the test statistic (result), estimate the likelihood that the difference is due purely to sample error 4. On the basis of likelihood of sample error, as measured by the Pr value, decide whether to reject or not reject the Null Hypothesis 5. Draw the appropriate conclusion in the context of the biomedical problem, e.g. some evidence, from the dataset, that subjects who underwent the stress test have higher pulse rate than the controls on average Test of significance: An example l l Suppose a random sample of 100 Malay male live births delivered at NUH gave a sample mean weight of 3.5kg with an sd of 0.9kg Question of interest: What is the likelihood that the mean birth weight from the sample population (all Malay male live birth delivered at NUH) is the same as the mean birth weight of all Malay male live births in the general population, after taking sample error into consideration? Test of significance: An example l l l l Suppose: X’ = 3.5kg, sd = 0.9kg, μpop = 3.0 kg σpop = 1.8kg Difference between means = 3.5 - 3.0 = 0.5kg Null Hypothesis, H0: μNUH = μpop Test of significance makes use of the normal distribution properties of the sampling distribution of the mean Test of significance l l Where does our sample mean of 3.5kg lie on the sampling distribution? If it lies outside the limit of μ±1.96 σ/√n, the likelihood that the sample belongs to the same population is equal or less than 0.05 (5%) of significance: z-test Test of Test significance: z-test X Test is given by = standard normal devia X l Test is given by = standard normal deviate • Test is given by SESEX = standard normal devia (SND or or z)z) X (SND (SND or z) SND expresses difference in standard l SND expresses the the difference in standard error erro • SND expresses the difference in standard erro units standard normal curve with with = 0 and units onona astandard normal curve = 0 an units =1on a standard normal curve with = 0 an 3.5 3.0 3.0 = 2.78 For our example, SND = 1.83./5 100 • For our example, SND = =2.78 2.78 l For our example, SND = = 1.8 / 100 0.475 0.475 0.475 0.475 -1.96 =3.0 =3.0 +1.96 SND=2.78 SND=2.78 Test of significance l There is a less than 5% chance that a difference of this magnitude (0.5kg) could have occurred on either side of the distribution, if the random sample of the 100 Malay males had come from a population whose mean birth weight is the same as that of the general population One-tailed vs. two-tailed tests l l l So far, we have been testing for a difference that can occur on both sides of the standard normal distribution In our example, the z-value of 2.78 gave a Pvalue of 0.0054 For a one-tailed test, a z-score of 2.78 will give a P-value of 0.0027 (=0.0054/2). That is occurred when we are absolutely sure that the mean birth weight of our sample always exceeds that of the general population One-tailed vs. two-tailed tests l l l l l Two-side tests are conventionally used because most of the time, we are not sure of the direction of the difference One-tailed test are used only when we can anticipate a priori the direction of a difference One-tailed tests are tempting because they are more likely to give a significant result Given the same z-score, the P-value is halved for one tailed test It also mean that they run a greater risk of rejecting the Null Hypothesis when it is in fact correct --- type I error Type I and type II errors Type I and type II errors If the difference is statistically significant, i.e. H0 is • Ifincorrect, the difference is to statistically failure reject H0significant, would leadi.e. to H0 type II is incorrect, failure to reject H0 would lead to type error l II error Copyright 2013 © Limsoon Wong, t-test l l l The assumption that the sampling distribution will be normally distributed holds for large samples but not for small samples Sample size is large, use z-test t-test is used when sample size is small – Statistical concept of t-distribution – Comparing means for 2 independent groups unpaired t-test – Comparing means for 2 matched groups paired t-test t-distribution l l l l Sampling distribution based on small samples will be symmetrical (bell shaped) but not necessarily normal Spread of these symmetrical distributions is determined by the specific sample size. The smaller the sample size, the wider the spread, and hence the bigger the standard error These symmetrical distributions are known as student’s t-distribution or simply, t-distribution The t-distribution approaches the normal distribution when sample size tends to infinity 55 Familiy of t-distribution samples for 2 independent t-test fort-test 2 independent samples or unpaired t-test l l X’10.03943 X 2 0.08157X1 = = 0.08157• X’2 = 0.04 0.03943 = 0.04 Question: WhatWhat is theis the • Question: probability that the the thatdifference probability of 0.04 units between of 0.04the difference two units samplebetween means has the occurred purely by chance, means two sample i.e. due samplingpurely error occurred has to alone? by chance, i.e. due to sampling error alone? Copyright 2013 © Limsoon Wo Unpaired t-testUnpaired t-test l• • l We the Weare aretesting testing the hypothesis that hypothesis that battery could batteryworkers workers have higher Pb could haveblood higher levels the control bloodthan Pb levels than group of workers as of the control group they are as they are workers occupationally occupationally exposed exposed Note: conventionally, a P-value of 0.05 is generally recognized as low enough to reject the Null Hypothesis of “no difference” Note: conventionally, a P-value of 0.05 is generally recognized as low enough to reject the Null Copyright 2013 © Limsoon Wong, Unpaired t-test l Unpaired t-test Null Hypothesis: No difference in mean blood Null•Hypothesis: No difference in mean blood PbPb level between battery workers control group, level between battery workers andand control group, i.e. i.e. H0: battery = control – H0: μbattery = μcontrol l t-score is given by by • t-score is given t X1 X 2 SE( X1 X 2 ) X1 1 ( n1 X2 2 1 (n1 1) s1 (n2 1) s2 ) n2 n1 n2 2 2 with (n1+n2–2) degrees of freedom with (n1+n2–2) degrees of freedom Copyright 2013 © Limsoon t-test Unpaired t-test Unpaired Unpaired t-test For the example • l For thegiven given example example given the • For t •l l .03943 0.081570.0 0t.08157 03943 0.002868 0.002868 = 14.7 with 12 d.f. = 14.7with with 12 d.f. = 14.7 12 d.f. • P-value <0.001, reject P-value <0.001, reject P-value reject hypothesis Null <0.001, NullSome hypothesis Null hypothesis evidence, from Some evidence, from that battery the data, Some evidence, from the data, thatin battery our study workers the data, that battery workers inhigher our study Pb blood have havelevel higher control the Pb than workers inblood our study levelgroup than the average on control have higher blood Pb group on average level than the control Copyright 2013 © Limsoon Wo Unpaired t-test assumptions l l l Data are normally distributed in the population from which the two independent samples have been drawn The two samples are random and independent, i.e. observations in one group are not related to observations in the other group The 2 independent samples have been drawn from populations with the same (homogeneous) variance, i.e. σ1=σ2 Example 2: Molecular classification of cancer l Expression of a gene called ATP6C (Vacuolar H+ ATPase proton channel subunit) in 38 bone marrow samples: – 27 acute lymphoblastic leukemia (ALL): 835 935 1665 764 1323 1030 1482 1306 593 2375 542 809 2474 1514 1977 1235 933 114 -9 3072 608 499 1740 1189 1870 892 677 – 11 acute myeloid leukemia (AML): 3237 1125 4655 3807 3593 2701 1737 3552 3255 4249 1870 l Question: Is ATP6C differentially expressed in the two classes of samples? Molecular classification of tumor example Paired t-test l Paired t-test Previous problem uses problem • Previous un-paired t-test as the un-paired twouses samples were t-test as the two samples matched matched were – i.e. the two samples were – i.e. the two samples independently derived were independently l Sometimes, we may derived need to deal with • Sometimes, we may matched study designs need to deal with matched study designs Copyright 2013 © Limsoon Won Paired t-test l Paired t-test Null hypothesis: No difference in mean cholesterol • Null hypothesis: No difference in mean cholesterol levels between fasting and postprandial states levels between fasting and postprandial states – H0: μfasting = μpostprandial H0: fasting = postprandial Copyright 2013 © Limsoon Wong, Paired t-test Paired t-test Paired t-test l • t-score given by t-score given by • t-score given by t t d d d SE 0SE .833 d .219 03.833 3.219 d sd d n sd 0.259 n 0.259 with (n-1) degrees with (n-1) degrees of of freedom, with (n-1) freedom, wheredegrees n where is the n is the # of pairs # of of pairs freedom, where n is the # of pairs Copyright 2013 © Limsoon Wong Copyright 2013 © Limsoo Paired t-test l l Conclusion: Insufficient evidence, from the data, to suggest that postprandial cholesterol levels are, on average, higher than fasting cholesterol levels Paired t-test Action: Should not reject the Null Hypothesis • Action: Should not reject the Null • Conclusion Insufficient evidence, fr data, to sug that postpra cholesterol are, on aver higher than Common errors relating to t-test l Failure to recognize assumptions – If assumption does not hold, explore data transformation or use of non-parametric methods l Failure to distinguish between paired and unpaired designs ANOVA (ANalysis Of VAriances 66 Hypothesis Test for Population Difference in Means l l Hypothesis Test for Population H0: μ1 – μ2 =0 Difference in Means (H0: μ – μ =0) 1 2 Example: Difference in average GPA’s for male Case 2: students and female students Attributes Test Statistic Confidence Interval n1≥30 , n2 ≥30 n1<30 or n2 <30 df=n1+n2-2 Common Standard Deviation Estimate ANOVA l l l Widely used statistical techniques for testing equality of population means when more than two populations are under consideration Extension of a two independent samples test for difference in population means when population variances are (or can be assumed to be) equal. Example Scenarios – Variation in Time to Relief of Symptoms Between and Within 3 Treatment – Music compressed by four MP3 compressors are with the same quality – Three new drugs are all as effective as a placebo – Lectures, group studying, and computer assisted instruction are equally effective for undergraduate students Example: l Variation in Time to Relief of Symptoms Between and Within 3 Treatment – n=15 subjects are randomly selected to participate in study – ni=5 subjects are randomly assigned to each treatment (k=3) and each subject reports the time to relief of symptoms (in minutes) Between Example:and Within 3 Treatment Example: l the sample means are numerically different – (29, 25, 20) => possibly large variability between treatments l the sample variances are similar, but small; – (0.158, 0.071, 0.158) => observations within each sample are tightly clustered around their respective sample means - small variability within treatments. ANOVA ANOVA 2 Populations k > 2 Populations I dentify groups 1 – Group 1 1 – Group 1 2 – Group 2 2 – Group 2 .... k – Group k I dentify parameters µ1 – population mean for µi– population mean for Group i Group 1 µ2 – population mean for Group 2 State the appropriate hypotheses H0: Ha: NOT all means equal (at least one μ is different) ANOVA Table Table ANOVA F-Distribution 74 F- distribution Example l l As an example, we analyze the Cushings data set, which is available from the MASS package. The Type variable in the data set shows the underlying type of syndrome, which can be one of four categories: – – – – l l adenoma (a), bilateral hyperplasia (b), carcinoma (c), unknown (u). The total number of observations is n=27, The number of observations in each group is – n1 =6, n2 =10, n3 =5,and n4 = 6. Example l We also find the group specific means, which are – 3.0, 8.2, 19.7, and 14.0, respectively. l The degrees of freedom parameters are – df1 =4−1=3 and df2 =27−4=23. l l l SSB = 893.5 and SSW = 2123.6. The observed value of F -statistic is f = 3.2 given under the column labeled F value. The resulting p-value is then 0.04. Example l Therefore, we can reject H0 at 0.05 significance level (but not at 0.01) and conclude that the differences among group means for urinary excretion rate of Tetrahydrocortisone are statistically significant (at 0.05 level). ANALYSIS OF CATEGORİCAL DATA Hypothesis testing involving categorical data l Chi-square test for statistical association involving 2x2 tables and RxC tables – Testing for associations involving small, unmatched samples – Testing for associations involving small, matched samples Association l l Examining relationship between 2 categorical variables Some examples of association: – Smoking and lung cancer – Ethic group and coronary heart disease l Questions of interest when testing for association between two categorical variables – Does the presence/absence of one factor (variable) influence the presence/absence of the other factor (variable)? l Caution – presence of an association does not necessarily imply causation Relating Relatingtotocomparison comparisonbetw betwproportions proportions • Proportion improved in drug group = 18/24 = 75% l Proportion improved in drug group = 18/24 = 75% • Proportion improved in placebo group = 9/20 = 45.0% l Proportion improved in placebo group = 9/20 = 45.0% • Question: What is the probability that the observed l Question: What is the probability that the difference of 30% is purely due to sampling error, i.e. observed difference of 30% is purely due to chance in sampling? sampling error, i.e. chance in sampling? • Use 2 –test l Use chi-squared –test Copyright 2013 © Limsoon W Chi-square test for statistical association Chi-square test for statistical association • l• l• Prob of selecting a person in drug group = 24/44 Prob personwith in drug group = =24/44 Prob of of selecting selecting aaperson improvement 27/44 Prob personfrom withdrug improvement Prob of of selecting selecting aaperson group who =had shown improvement= (24/44)*(27/44) = 0.3347 27/44 (assuming two independent events) l Prob of selecting a person from drug group who • Expected value for cell (a) =0.3347*44 = 14.73 l had shown improvement= (24/44)*(27/44) = 0.3347 (assuming two independent events) Copyright © Limsoon Wong, Expected value for cell (a) =0.3347*44 = 2013 14.73 Chi-square test for statistical association Chi-square Chi-square test test for for statistical statistical association association • General formula for 2 Generalformula formulafor for2 2 l •General 2 2 • • (obs exp) 22 (obs exp) exp exp 2 –test is always performed on categorical 2 –test is always performed on categorical variables using–test absolute frequencies, neveron l Chi-squared is always performed variables using absolute frequencies, never percentage or proportion categorical variables using absolute frequencies, percentage or proportion never percentage or proportion Copyright 2013 © Limsoon Wong, Copyright 2013 © Limsoon Wong, 74 Chi-square test for statistical association Chi-square test for statistical association l For the given hi-square test forproblem: statistical association 74 • For the given problem: 2 For the given problem: (obs exp) (18 14.73) 2 (6 9.27) 2 (9 12.27) 2 (11 7.73) 2 2 12 Chi-square for association (obs exp)exp (18 14test .7314 ) 2.73 (6 9statistical .27)92 .27(9 12.27 ) 2 .27(11 7.73)72.73 4.14 with 1 degree exp 14.73of freedom 9.27 12.27 7.73 given problem: 4•.14For withthe 1 degree of freedom • 2 degree of2 freedom is2 given by: 18) 2 6(11 24 (obs exp) (18 14.73) (6 9.27) 2 (9 12.27 7.73) 2 2 degree ofexpfreedom is.73 given by: (no. of rows-1)*(no. ofof cols-1) l Chi-square degree 14 9freedom .27 12is .276given 720 .73 (no. 9 24 11 by: 18 (no. rows-1)*(no. of cols-1) cols-1) = (2-1)*(2-1) of4of rows-1)*(no. of =171 44 (2-1)*(2-1) = 1freedom 27 20 .= 14 with 1 degree of 9 11 = (2-1)*(2-1) = 1 17 44 How many of27these • 2 degree of freedom is given by: 4 cells free to 18 6 How many of are these vary if wetokeep the 9 (no. of rows-1)*(no.4 of 11 cellscols-1) are free row and column vary if we keep the = (2-1)*(2-1) = 1 27 17 totals constant? 24 20 44 row and column totals constant? How many of these Copyright 4 cells are free to 2013 © Limsoon Wong, vary if we keep the 75 Chi-square test for statistical association l l Probability of getting an observed difference of 30% in improvement rates if the Null hypothesis of no association is correct is between 2% and 5% Hence, there is some statistical evidence from this study to suggest that treatment of arthritic patient with the drug can significantly improve grip strength