Introduction to Biostatistics for Clinical and Translational Researchers KUMC Departments of Biostatistics & Internal Medicine University of Kansas Cancer Center FRONTIERS: The Heartland Institute of Clinical and Translational Research Course Information Jo A. Wick, PhD Office Location: 5028 Robinson Email: jwick@kumc.edu Lectures are recorded and posted at http://biostatistics.kumc.edu under ‘Events and Opportunities’ Inferences: Hypothesis Testing Experiment An experiment is a process whose results are not known until after it has been performed. The range of possible outcomes are known in advance We do not know the exact outcome, but would like to know the chances of its occurrence The probability of an outcome E, denoted P(E), is a numerical measure of the chances of E occurring. 0 ≤ P(E) ≤ 1 Probability The most common definition of probability is the relative frequency view: # of times x = a P x = a = total # of observations of x Probabilities for the outcomes of a random variable 0.2 P(x) 0.05 10 0.10 15 Length of stay = 6 days 0.00 0.0 5 0 Probability of LOS = 6 days 0.15 0.4 x are represented through a probability distribution: 1 0 -4 2 3 2 4 -2 54 6 6 7 08 x 8 9 10 10 11 2 1212 14 4 Population Parameters Most often our research questions involve unknown population parameters: What is the average BMI among 5th graders? What proportion of hospital patients acquire a hospitalbased infection? To determine these values exactly would require a census. However, due to a prohibitively large population (or other considerations) a sample is taken instead. Sample Statistics Statistics describe or summarize sample observations. They vary from sample to sample, making them random variables. We use statistics generated from samples to make inferences about the parameters that describe populations. Sampling Variability Samples μ σ x 0 s 1 x 0.15 s 1.1 Population Sampling Distribution x of x 0.1 s 0.98 Recall: Hypotheses Null hypothesis “H0”: statement of no differences or association between variables This is the hypothesis we test—the first step in the ‘recipe’ for hypothesis testing is to assume H0 is true Alternative hypothesis “H1”: statement of differences or association between variables This is what we are (usually) trying to prove Hypothesis Testing One-tailed hypothesis: outcome is expected in a single direction (e.g., administration of experimental drug will result in a decrease in systolic BP) H1 includes ‘<‘ or ‘>’ Two-tailed hypothesis: the direction of the effect is unknown (e.g., experimental therapy will result in a different response rate than that of current standard of care) H1 includes ‘≠‘ Hypothesis Testing The statistical hypotheses are statements concerning characteristics of the population(s) of interest: Population mean: μ Population variability: σ Population rate (or proportion): π Population correlation: ρ Example: It is hypothesized that the response rate for the experimental therapy is greater than that of the current standard of care. πExp > πSOC ← This is H1. Recall: Decisions Type I Error (α): a true H0 is incorrectly rejected “An innocent man is proven GUILTY in a court of law” Commonly accepted rate is α = 0.05 Type II Error (β): failing to reject a false H0 “A guilty man is proven NOT GUILTY in a court of law” Commonly accepted rate is β = 0.2 Power (1 – β): correctly rejecting a false H0 “Justice has been served” Commonly accepted rate is 1 – β = 0.8 Decisions Truth Conclusion H1 H0 H1 Correct: Power Type I Error H0 Type II Error Correct Basic Recipe for Hypothesis Testing 1. State H0 and H1 2. Assume H0 is true ← Fundamental assumption!! 3. Collect the evidence—from the sample data, compute the appropriate sample statistic and the test statistic Test statistics quantify the level of evidence within the sample—they also provide us with the information for computing a p-value (e.g., t, chi-square, F) 4. Determine if the test statistic is large enough to meet the a priori determined level of evidence necessary to reject H0 (. . . or, is p < α?) Example: Carbon Monoxide An experiment is undertaken to determine the concentration of carbon monoxide in air. It is a concern that the actual concentration is significantly greater than 10 mg/m3. Eighteen air samples are obtained and the concentration for each sample is measured. The outcome x is carbon monoxide concentration in samples. The characteristic (parameter) of interest is μ—the true average concentration of carbon monoxide in air. Step 1: State H0 & H1 H1: μ > 10 mg/m3 ← We suspect! 0.4 H0: μ ≤ 10 mg/m3 ← We assume in order to test! 0.2 0.0 P(x) Step 2: Assume μ = 10 -4 -2 μ = 10 0 x 2 4 Step 3: Evidence 10.25 10.37 10.66 10.47 10.56 10.22 10.44 10.38 10.63 10.40 10.39 10.26 10.32 10.35 10.54 10.33 10.48 10.68 Sample statistic: x = 10.43 Test statistic: t What does 1.79 mean? How do we use it? x μ0 10.43 10 1.79 s 1.02 n 18 Student’s t Distribution 0.4 Remember when we assumed H0 was true? 0.2 0.0 P(x) Step 2: Assume μ = 10 -4 -2 μ = 10 0 x 2 4 Student’s t Distribution What we were actually doing was setting up this theoretical Student’s t distribution from which the pvalue can be calculated: xμ 10 10 0 s 0.2 n 0.0 P(x) 0.4 t -4 -2 0 t=0 x 2 4 0 1.02 18 Student’s t Distribution Assuming the true air concentration of carbon 0.4 monoxide is actually 10 mg/mm3, how likely is it that we should get evidence in the form of a sample mean equal to 10.43? 0.2 P x 10.43 ? 0.0 P(x) Step 2: Assume μ = 10 -4 -2 μ = 10 0 2 x x =10.43 4 Student’s t Distribution We can say how likely by framing the statement in terms of the probability of an outcome: 0.2 x μ0 10 10 0 s 1.02 n 18 p = P(t ≥ 1.79) = 0.0456 0.0 P(x) 0.4 t -4 -2 0 t=0 2 x t = 1.79 4 Step 4: Make a Decision Decision rule: if p ≤ α, the chances of getting the actual collected evidence from our sample given the null hypothesis is true are very small. The observed data conflicts with the null ‘theory.’ The observed data supports the alternative ‘theory.’ Since the evidence (data) was actually observed and our theory (H0) is unobservable, we choose to believe that our evidence is the more accurate portrayal of reality and reject H0 in favor of H1. Step 4: Make a Decision What if our evidence had not been in as great of degree of conflict with our theory? p > α: the chances of getting the actual collected 0.2 P x 10.1 ? 0.0 P(x) 0.4 evidence from our sample given the null hypothesis is true are pretty high We fail to reject H0. -4 -2 10 0 x 2 x = 10.1 4 Decision How do we know if the decision we made was the correct one? We don’t! If α = 0.05, the chances of our decision being an incorrect reject of a true H0 are no greater than 5%. We have no way of knowing whether we made this kind of error—we only know that our chances of making it in this setting are relatively small. Which test do I use? What kind of outcome do you have? Nominal? Ordinal? Interval? Ratio? How many samples do you have? Are they related or independent? Types of Tests One Sample Measurement Level Population Parameter Hypotheses Sample Statistic Nominal Proportion π H0: π = π0 H1: π ≠ π0 Ordinal Median M H0: M = M0 H1: M ≠ M0 Interval Mean μ H0: μ = μ0 H1: μ ≠ μ0 x Student’s t or Wilcoxon (if non-normal or small n) Ratio Mean μ H0: μ = μ0 H1: μ ≠ μ0 x Student’s t or Wilcoxon (if non-normal or small n) p= x n m = p50 Inferential Method(s) Binomial test or z test (if np > 10 & nq > 10) Wilcoxon signed-rank test Types of Tests Parametric methods: make assumptions about the distribution of the data (e.g., normally distributed) and are suited for sample sizes large enough to assess whether the distributional assumption is met Nonparametric methods: make no assumptions about the distribution of the data and are suitable for small sample sizes or large samples where parametric assumptions are violated Use ranks of the data values rather than actual data values themselves Loss of power when parametric test is appropriate Types of Tests Two Independent Samples Measurement Level Population Parameters Hypotheses Sample Statistics Nominal π1, π2 H0: π1 = π2 H1: π1 ≠ π2 Ordinal M1, M2 H0: M1 = M2 H1: M1 ≠ M2 m1, m2 Median test Interval μ1, μ2 H0: μ1 = μ2 H1: μ1 ≠ μ2 x1 x2 Student’s t or Mann-Whitney (if non-normal, unequal variances or small n) Ratio μ1, μ2 H0: μ1 = μ2 H1: μ1 ≠ μ2 x1 x2 Student’s t or Mann-Whitney (if non-normal, unequal variances or small n) p1 = x1 n1 p2 = Inferential Method(s) x2 n2 Fisher’s exact or Chi-square (if cell counts > 5) Comparing Central Tendency # Groups 2 >2 Normal or large n Independent Samples Dependent Samples 2-sample t Non-normal or small n Independent Samples Paired t Normal or large n Dependent Samples Wilcoxon Signed-Rank Independent Samples Wilcoxon RankSum Non-normal or small n Dependent Samples ANOVA Independent Samples 2-way ANOVA Dependent Samples Kruskal-Wallis Friedman’s Two-Sample Test of Means Clotting times (minutes) of blood for subjects given one of two different drugs: Drug B Drug G 8.8 8.4 9.9 9.0 7.9 8.7 11.1 9.6 9.1 9.6 8.7 10.4 x1 8.75 x2 9.74 9.5 It is hypothesized that the two drugs will result in different blood-clotting times. H1: μB ≠ μG H0: μB = μG Two-Sample Test of Means What we’re actually hypothesizing: H0: μB μG = 0 0.4 x1 x2 0.99 0.2 0.0 P(x) Evidence! -4 -2 μB μG = 0 0 2 4 x P x1 x2 0.99 ? P x1 x2 0.99 ? Two-Sample Test of Means What we’re actually hypothesizing: H0: μB μG = 0 p = P(|t| > 2.475) = 0.03 x1 x2 0.2 s12 s22 n1 n2 0.0 P(x) 0.4 t -4 t = 2.48 -2 0 t=0 x ***Two-sided tests detect ANY evidence in EITHER direction that the null difference is unlikely! 2 t = +2.48 4 8.75 9.74 2.475 0.40 Assumptions of t In order to use the parametric Student’s t test, we have a few assumptions that need to be met: Approximate normality of the observations In the case of two samples, approximate equality of the sample variances Assumption Checking To assess the assumption of normality, a simple histogram would show any issues with skewness or outliers: Assumption Checking Skewness Assumption Checking Other graphical assessments include the QQ plot: Assumption Checking Violation of normality: Assumption Checking To assess the assumption of equal variances (when groups = 2), simple boxplots would show any issues with heteroscedasticity: Assumption Checking Rule of thumb: if the larger variance is more than 2 times the smaller, the assumption has been violated Now what? If you have enough observations (20? 30?) to be able to determine that the assumptions are feasible, check them. If violated: • Try a transformation to correct the violated assumptions (natural log) and reassess; proceed with the t-test if fixed • If a transformation doesn’t work, proceed with a non-parametric test • Skip the transformation altogether and proceed to the nonparametric test If okay, proceed with t-test. Now what? If you have too small a sample to adequately assess the assumptions, perform the nonparametric test instead. For the one-sample t, we typically substitute the Wilcoxon signed-rank test For the two-sample t, we typically substitute the MannWhitney test Consequences of Nonparametric Testing Robust! Less powerful because they are based on ranks which do not contain the full level of information contained in the raw data When in doubt, use the nonparametric test—it will be less likely to give you a ‘false positive’ result. Speaking of Power “How many subjects do we need?” Statistical methods can be used to determine the required number of patients to meet the trial’s principal scientific objectives. Other considerations that must be accounted for include availability of patients and resources and the ethical need to prevent any patient from receiving inferior treatment. We want the minimum number of patients required to achieve our principal scientific objective. The Size of a Clinical Trial For the chosen level of significance (type I error rate, α), a clinically meaningful difference (Δ) between two groups can be detected with a minimally acceptable power (1 – β) with n subjects. Example: Detecting a Difference Primary objective: To compare pain improvement in knee OA for new treatment A compared to standard treatment S. Primary outcome: Change in pain score from baseline to 24 weeks (continuous). Data analysis: Comparison of mean change in pain score of patients on treatment A (μ1) versus standard (μ2) using a two-sided t-test at the α = 0.05 level of significance. Example: Detecting a Difference Difference to detect (Δ): It has been determined that a difference of 10 on this pain scale is clinically meaningful. If standard therapy results in a 5 point decrease, our new therapy would need to show a decrease of at least 15 (5 + 10) to be declared clinically different from the standard. We would like to be 80% sure that we detect this difference as statistically significant. Example: Detecting a Difference What usually occurs on the standard? This is important information because it tells us about the behavior of the outcome (pain scale) in these patients. If the pain scale has great variability, it may be difficult to detect small to moderate changes (signal-to-noise)! Change in Pain from Baseline Difference = 20! S A 0 Change in Pain from Baseline ‘Signal-to-Noise’ Difference = 20! S A Example: Detecting a Difference We have: H0: μ1 = μ2 versus H1: μ1 μ2 (Δ= 0) α = 0.05 1 – β = 0.80 Δ = 10 For continuous outcomes we need to determine what difference would be clinically meaningful, but specified in the form of an effect size which takes into account the variability of the data. Example: Detecting a Difference Effect size is the difference in the means divided by the standard deviation, usually of the control or comparison group, or the pooled standard deviation of the two groups d where 1 2 12 22 n1 n2 Example: Detecting a Difference Power Calculations an interesting interactive web-based tool to show the relationship between power and the sample size, variability, and difference to detect. A decrease in the variability of the data results in an increase in power for a given sample size. An increase in the effect size results in a decrease in the required sample size to achieve a given power. Increasing α results in an increase in the required sample size to achieve a given power. Inferences on Two Means Example: Smoking cessation Two types of therapy: x = {behavioral therapy, literature} Dependent variable: y = % decrease in number of cigarettes smoked per day after six months of therapy Behavioral Therapy Literature Only 10 6 20 2 65 0 0 12 30 4 Smoking Cessation Research question: Is behavioral therapy in addition to education better than education alone in getting smokers to quit? H0: μ1 = μ2 versus H1: μ1 ≠ μ2 Two independent samples t-test IF: the change is approximately normal OR can be transformed to an approximate normal distribution (e.g., natural log) the variability within each group is approximately the same (ROT: no more than 2x difference) Smoking Cessation Reject H0: μ1 = μ2 Conclusion: Adding behavioral therapy to cessation education results in—on average—a greater reduction in cigarettes smoked per day at six months post-therapy when compared to education alone (t30.9 = 2.87, p < 0.01). Smoking Cessation The 95% confidence interval is: 8.39 ≤ μ1 μ2 ≤ 1.42 Interpretation: On average, behavioral therapy resulted in an additional reduction of 4.9% (95%CI: 1.42%, 8.39%) relative to control. Confidence Intervals What exactly do confidence intervals represent? Remember that theoretical sampling distribution concept? It doesn’t actually exist, it’s only mathematical. What would we see if we took sample after sample after sample and did the same test on each . . . Confidence Intervals Suppose we actually took sample after sample . . . 100 of them, to be exact Every time we take a different sample and compute the confidence interval, we will likely get a slightly different result simply due to sampling variability. Confidence Intervals Suppose we actually took sample after sample . . . 100 of them, to be exact 95% confident means: “In 95 of the 100 samples, our interval will contain the true unknown value of the parameter. However, in 5 of the 100 it will not.” Confidence Intervals Suppose we actually took sample after sample . . . 100 of them, to be exact Our “confidence” is in the procedure that produces the interval—i.e., it performs well most of the time. Our “confidence” is not directly related to our particular interval—we cannot say “The probability that the mean difference is between (1.4,8.4) is 0.95.” Inferences on More Than Two Means Example: Smoking cessation Three types of therapy: x = {pharmaceutical therapy, behavioral therapy, literature} Dependent variable: y = % decrease in number of cigarettes smoked per day after six months of therapy Pharmaceutical Therapy Behavioral Therapy Literature Only 10 10 6 30 0 20 60 6 0 32 0 12 65 30 4 Smoking Cessation Research question: Is therapy in addition to education better than education alone in getting smokers to quit? If so, is one therapy more effective? H0: μ1 = μ2 = μ3 versus H1: At least one μ is different More than 2 independent samples requires an ANOVA: the change is approximately normal OR can be transformed to an approximate normal distribution (e.g., natural log) the variability within each group is approximately the same (ROT: no more than 2x difference) Smoking Cessation ANOVA produces a table: One-way ANOVA indicates you have a single categorical factor x (e.g., treatment) and a single continuous response y and your interest is in comparing the mean response μ across the levels of the categorical factor. Wait . . . Why is ANOVA using variances when we’re hypothesizing about means? Between-groups mean square: a variance Within-groups mean square: also a variance F: a ratio of variances—F = MSBG/MSWG What’s the Rationale? In the simplest case of the one-way ANOVA, the variation in the response y is broken down into parts: variation in response attributed to the treatment (group/sample) and variation in response attributed to error (subject characteristics + everything else not controlled for) The variation in the treatment (group/sample) means is compared to the variation within a treatment (group/sample) using a ratio—this is the F test statistic! If the between treatment variation is a lot bigger than the within treatment variation, that suggests there are some different effects among the treatments. Rationale 1 2 3 Rationale There is an obvious difference between scenarios 1 and 2. What is it? Just looking at the boxplots, which of the two scenarios (1 or 2) do you think would provide more evidence that at least one of the populations is different from the others? Why? Rationale 1 2 3 F Statistic F= Variation between the sample means Natural variation within the samples Case A: If all the sample means were exactly the same, what would be the value of the numerator of the F statistic? Case B: If all the sample means were spread out and very different, how would the variation between sample means compare to the value in A? F Statistic F= Variation between the sample means Natural variation within the samples So what values could the F statistic take on? Could you get an F that is negative? What type of values of F would lead you to believe the null hypothesis—that there is no difference in group means—is not accurate? Smoking Cessation ANOVA produces a table: Conclusion: Reject H0: μ1 = μ2 = μ3. Some difference in the number of cigarettes smoked per day exists between subjects receiving the three types of therapy. Smoking Cessation ANOVA produces a table: But where is the difference? Are the two experimental therapies different? Or is it that each are different from the control? Smoking Cessation Reject H0: μ1 = μ3 and μ1 = μ2. Both pharmaceutical and behavioral therapy are significantly different from the literature only control group, but the two therapies are not different from each other. Smoking Cessation Conclusion: Adding either behavioral (p = 0.015) or pharmaceutical therapy (p < 0.01) to cessation education results in—on average—significantly greater decreases in cigarettes smoked per day at six months post-therapy when compared to education alone. Inferences on Means Concerns a continuous response y One or two groups: t More than two groups: ANOVA Remember, this (and the two-sample case) is essentially looking at the association between an x and a y, where x is categorical (nominal or ordinal) and y is continuous (interval or ratio). Check assumptions! Normality of y Equal group variances ANOVA Models There are many . . . Randomized designs with one treatment A. Subjects not subdivided on any basis other than randomization prior to assignment to treatment levels; no restriction on random assignment other than the option of assigning the same number of subjects to each treatment level 1. Completely randomized or one factor design B. Subjects subdivided on some nonrandom basis or one or more restrictions on random assignment other than assigning the same number of subjects to each treatment level 1. Balanced incomplete block design 2. Crossover design 3. Generalized randomized block design 4. Graeco-Latin square design 5. Hyper-Graeco-Latin square design 6. Latin square design 7. Partially balanced incomplete block design 8. Randomized block design 9. Youden square design Randomized designs with two or more treatments A. Factorial experiments: designs in which all treatment levels are crossed 1. Designs without confounding a. Completely randomized factorial design b. Generalized randomized factorial design c. Randomized block factorial design 2. Design with group-treatment confounding a. Split-plot factorial design 3. Designs with group-interaction confounding a. Latin square confounded factorial design b. Randomized block completely confounded factorial design c. Randomized block partially confounded factorial design 4. Designs with treatment-interaction confounding a. Completely randomized fractional factorial design Inferences on Proportions (k = 2) Example: plant genetics Two phenotypes: x = {yellow-flowered plants, greenflowered plants} Dependent variable: y = proportion of plants out of 100 progeny that express each phenotype Phenotype Yellow Yellow Green Yellow Green x y= n Plant Genetics The plant geneticist hypothesizes that his crossed progeny will result in a 3:1 phenotypic ratio of yellow-flowered to green-flowered plants. H0: The population contains 75% yellow-flowered plants versus H1: The population does not contain 75% yellow-flowered plants. H0: πy = 0.75 versus H1: πy ≠ 0.75 This particular type of test is referred to as the chi- square goodness of fit test for k = 2. Plant Genetics Chi-square statistics compute deviations between what is expected (under H0) and what is actually observed in the data: 2 x O E 2 E DF = k – 1 where k is number of categories of x Plant Genetics Suppose the researcher actually observed in his sample of 100 plants this breakdown of phenotype: Phenotype f (%) Yellow-flowered 84 (84%) Green-flowered 16 (16%) Does it appear that this type of sample could have come from a population where the true proportion of yellow-flowered plants is 75%? Plant Genetics Phenotype f (%) Yellow-flowered 84 (84%) Green-flowered 16 (16%) 2 1 84 75 75 2 16 25 25 2 4.32 Conclusion: Reject H0: πy = 0.75—it does not appear that the geneticist’s hypothesis about the population phenotypic ratio is correct (p = 0.038). Inferences on Proportions (k > 2) Example: plant genetics Four phenotypes: x = {yellow-smooth flowered, yellowwrinkled flowered, green-smooth flowered, greenwrinkled flowered} Dependent variable: y = proportion of plants out of 250 progeny that express each phenotype Phenotype Yellow smooth Yellow smooth Green wrinkled Yellow wrinkled x y= n Plant Genetics The plant geneticist hypothesizes that his crossed progeny will result in a 9:3:3:1 phenotypic ratio of YS:YW:GS:GW plants. Actual numeric hypothesis is H0: π1 = 0.5625, π2 = 0.1875, π3 = 0.1875, π4 = 0.0625 This particular type of test is referred to as the chisquare goodness of fit test for k = 4. Plant Genetics Chi-square statistics compute deviations between what is expected (under H0) and what is actually observed in the data: 2 x O E 2 E DF = k – 1 where k is number of categories of x Plant Genetics Suppose the researcher actually observed in his sample of 250 plants this breakdown of phenotype: Phenotype f (%) YS 152 (60.8%) YW 39 (15.6%) GS 53 (21.2%) GW 6 (2.4%) Does it appear that this type of sample could have come from a population where the true phenotypic ratio is as the geneticist hypothesized? Plant Genetics Phenotype f (%) YS 152 (60.8%) YW 39 (15.6%) GS 53 (21.2%) GW 6 (2.4%) 32 8.972 Conclusion: Reject H0—it does not appear that the geneticist’s hypothesis about the population phenotypic ratio is correct (p = 0.03). Inferences on Proportions Concerns a categorical response y Regardless of the number of groups, a chi-square test may be used Remember, this is essentially looking at the association between an x and a y, where x is categorical (nominal or ordinal) and y is categorical (nominal or ordinal). Assumptions? ROT: No expected frequency should be less than 5 (i.e., nπ < 5) If not met, use the binomial (k = 2) or multinomial (k > 2) test Inferences on Proportions What do we do when we have nominal data on more than one factor x? Gender and hair color Menopausal status and disease stage at diagnosis ‘Handedness’ and gender We still use chi-square! These types of tests are looking at whether two categorical variables are independent of one another—thus, tests of this type are often referred to as chi-square tests of independence. Inferences on Proportions Example: Hair color and Gender Gender: x1 = {M, F} Hair Color: x1 = {Black, Brown, Blonde, Red} Male Female Total Black Brown Blonde Red Total 32 (32%) 43 (43%) 16 (16%) 9 (9%) 100 64 (32%) 16 (8%) 200 80 25 N = 300 55 (27.5%) 65 (32.5%) 87 108 What the data should look like in the actual dataset: Gender Hair Color Male Black Female Red Female Blonde Hair Color and Gender The researcher hypothesizes that hair color is not independent of sex. H0: Hair color is independent of gender (i.e., the phenotypic ratio is the same within each gender). H1: Hair color is not independent of gender (i.e., the phenotypic ratio is different between genders). Hair Color and Gender Chi-square statistics compute deviations between what is expected (under H0) and what is actually observed in the data: 2 x O E 2 E DF = (r – 1)(c – 1) where r is number of rows and c is number of columns Hair Color and Gender Does it appear that this type of sample could have come from a population where the different hair colors occur with the same frequency within each gender? OR does it appear that the distribution of hair color is different between men and women? Male Female Total Black Brown Blonde Red Total 32 (32%) 43 (43%) 16 (16%) 9 (9%) 100 64 (32%) 16 (8%) 200 80 25 N = 300 55 (27.5%) 65 (32.5%) 87 108 Hair Color and Gender Male Female Total Black Brown Blonde Red Total 32 (32%) 43 (43%) 16 (16%) 9 (9%) 100 64 (32%) 16 (8%) 200 80 25 N = 300 55 (27.5%) 65 (32.5%) 87 108 32 7.815 Conclusion: Reject H0: Gender and Hair Color are independent. It appears that the researcher’s hypothesis that the population phenotypic ratio is different between genders is correct (p = 0.029). Inferences on Proportions Special case: when you have a 2X2 contingency table, you are actually testing a hypothesis concerning two population proportions: H0: π1 = π2 (i.e., the proportion of males who are blonde is the same as the proportion of females who are blonde). Blonde Non-blonde Total Male 16 (16%) 84 (84%) 100 Female 64 (32%) 136 (68%) 200 Total 80 (26.7%) 220 (73.3%) N = 300 Inferences on Proportions When you have a single proportion and have a small sample, substitute the Binomial test which provides exact results. The nonparametric Fisher Exact test can be always be used in place of the chi-square test when you have contingency table-like data (i.e., two categorical factors whose association is of interest)—it should be substituted for the chisquare test of independence when ‘cell’ sizes are small. Next Time Linear Regression and Correlation Survival Analysis Final Thoughts