BioStatistics Why Statistics? You want to make the strongest conclusions based on limited data Differences in biological systems sometimes cannot be easily observed Random variation? Real difference? Statistics sometimes are Unnecessary Large differences in observed events And small scatter within groups In most instances, though, the use of statistics can provide you with mathematically-based conclusions Clinical research Field research Statistics extrapolate from sample to population The only way to draw absolute conclusions about a population is to measure the trait(s) of interest of every individual in that population The reality is, this is almost always impossible to do Thus, randomly sampling some of the individuals can provide information about the entire population Sometimes random sampling can be difficult to define If your sample is not random, then conclusions Samples and Populations Quality control A company manufactures 20,000 vials (population) of a vaccine from a single production run About 50 vials (samples) are taken from this production run and analyzed for a variety of characteristics The results on 50 vials are then extrapolated to the remaining vials Samples and Populations Political polls The number of eligible U. S. voters is about 125,000,000 (population) A few hundred or thousands (sample) are asked to respond to political questions Samples and Populations Clinical studies Patients in a clinical study (sample) have a clinical condition (e.g., disease) They rarely reflect the entire population However, they often reflect the population with the condition Sampling humans can be particularly difficult Samples and Populations Field experiments Local variations Impact of weather Environmental conditions/changes Human impact Sampling bias Samples and Populations Laboratory experiments Usually not necessary Highly-controlled experiments Single variable Genetically-defined organisms Very little variation What statistical calculations can do Statistical estimation Calculation of a mean within a population is a precise number However, the number is only an estimate of the whole population Statistical hypothesis testing Helps determine if an observed difference is due simply to random chance Provides a P value; if P is small, the difference is unlikely due to random chance and the conclusion is statistically significant Statistical modeling Tests how well experimental data fit a mathematical model The most common form of statistical modeling is linear regression LR usually determines the best straight line through a set of data points What statistical calculations cannot do Analysis of a simple experiment Define a population you are interested in Randomly select a sample of subjects to study Randomly split the sample subjects into two groups One group gets one treatment The other group gets another treatment Measure a single variable trait in each subject Use statistical tests to determine if there’s a difference between the groups What statistical calculations cannot do The problems with real experiments Populations can be more diverse than your samples Samples are collected on convenience, rather than randomly The measured value is proxy value for what you’re really interested in Errors in data collection Record data incorrectly Assays may not report what you think they report You need to combine different types of measurements to reach an overall conclusion (multiple variables) Why statistics are difficult to learn Deceptive terminology (significant, error, hypothesis) Statistical conclusions are never absolute (statistically significant) Statistics uses abstract concepts (populations, probabilities) Statistics are at the interface of math and science Many statistical calculations require complex math Variables Independent variable - The variable scientists manipulate to evaluate a response Dependent variable - The variable (i.e., trait) resulting from a treatment with an independent variable Variables Types of variables in biology Measurement variables Continuous Discontinuous Ranked variables Attributes Variables Measurement variables - Those whose differing states can be expressed in a numerically-ordered fashion Continuous Can assume any value between two distinct points For example, there are infinite numbers between 1.5 and 1.6 Include: lengths, areas, volumes, weights, angles, temperatures, periods of time, percentages, rates Discontinuous Discrete values that can only have fixed numerical values The number of segments in an insect’s appendage may be 4, 5, or 6, but not 4.3 Variables Ranked variables Variables that cannot be measured For example, order of emergence of pupae without regard to time Attribute variables Variables that cannot be measured, but must be expressed qualitatively For example: black/white; pregnant/nonpregnant; male/female; live/dead Appropriate tests Design Measurement Var Ranked Var Computing median and frequencies 1 variable Computing means 1 sample Computing standard deviations Attribute Var Confidence limits for percentages Runs test for randomness t-tests 1 Variable Test of equality 2 samples Paired comparisons test Mann-Whitney Utest Testing differences Kolmogorovbetween two Smirnov two-sample percentages test 1 Variable ANOVA 2+ Tukey-Kramer test Samples Kruskal-Wallis test G-test for Friedman’s random- percentages ized block test 2 Variables Regression analysis Polynomial regression Olmstead and Tukey’s Ordering test Spearman’s rank Chi-square test Fisher’s exact test Means and Standard Deviations The mean is the average of measured trait from a population In biology, we usually compare two or more populations, which we call groups The standard deviation is the variance around the mean Many statistical tests use means and standard deviations to determine if there are significant differences between groups null hypothesis Used to assume an event is true Statistics can be used to disprove the hypothesis This lends support to an alternative hypothesis Nearly every experiment that uses statistics should define null and alternative hypotheses Student’s T-test Determines if there is a significant difference between the means of two groups of measured data Paired - compares matched values between members of a group Unpaired - assumes values between members are not related Tests values for fit to a normal (aka -Gaussian) distribution (“bell curve”) If not, then use nonparametric testing One-tailed vs. two-tailed One-tailed: You must specify which group will have a larger mean in advance of data collection Two-tailed: You do not know which group will have a larger mean in advance of data collection Student’s T-test P value: Is there a significant difference between the means of the two groups? Generally, if the P value is less than or equal to 0.05, then the difference is considered significant t-value: Positive if the first mean is larger than the second and negative if it is smaller Student’s T-test Confidence interval The calculated mean is unlikely the exact same as the entire population Assumes your samples are randomly collected and fit a normal distribution If your sample is large with a small standard deviation, then your calculated mean likely is close to the actual mean The CI is a calculation based upon sample size and standard deviation If the CI is 95%, then the range of your calculated mean (i.e, standard deviation) probably (95%) includes the actual mean of the population under study