Basic Statistical Concepts Donald E. Mercante, Ph.D. Biostatistics School of Public Health LSU-HSC Two Broad Areas of Statistics Descriptive Statistics - Numerical descriptors - Graphical devices - Tabular displays Inferential Statistics - Hypothesis testing - Confidence intervals - Model building/selection Descriptive Statistics When computed for a population of values, numerical descriptors are called Parameters When computed for a sample of values, numerical descriptors are called Statistics Descriptive Statistics Two important aspects of any population Magnitude of the responses Spread among population members Descriptive Statistics Measures of Central Tendency (magnitude) Mean - most widely used - uses all the data - best statistical properties - susceptible to outliers Median - does not use all the data - resistant to outliers Descriptive Statistics Measures of Spread (variability) range - simple to compute - does not use all the data variance - uses all the data - best statistical properties - measures average distance of values from a reference point Properties of Statistics • Unbiasedness - On target • Minimum variance - Most reliable • If an estimator possesses both properties then it is a MINVUE = MINimum Variance Unbiased Estimator • Sample Mean and Variance are UMVUE = Uniformly MINimum Variance Unbiased Estimator Inferential Statistics - Hypothesis Testing - Interval Estimation Hypothesis Testing Specifying hypotheses: H0: “null” or no effect hypothesis H1: research or alternative hypothesis Note: Only H0 (null) is tested. Errors in Hypothesis Testing Reality Decision H0 True H0 False Fail to Reject H0 Reject H0 Hypothesis Testing In parametric tests, actual parameter values are specified for H0 and H1. H0: µ < 120 H1: µ > 120 Hypothesis Testing Another example of explicitly specifying H0 and H1. H 0: = 0 H 1: 0 Hypothesis Testing General framework: • Specify null & alternative hypotheses • Specify test statistic • State rejection rule (RR) • Compute test statistic and compare to RR • State conclusion Common Statistical Tests Test Name Purpose One-sample (z) t-test Test value of a mean Two-sample (z) t-test Compare two means Paired t-test Compare difference in means (compare related means) ANOVA Test for differences in 2 or more means Common Statistical Tests (cont.) Test Purpose Test on binomial proportion(s) Test whether binomial proportions =0, or each other. Test on correlation coefficient(s) Test whether correlation coefficient =0, or each other. Regression Test whether slope = 0 RxC contingency table analysis Test whether two categorical variables are related Advanced Topics Test Purpose Multivariate Tests e.g., MANOVA Test value of several parameters simultaneously Repeated Measures / Crossovers Test means when subjects repeatedly measured Survival Analysis Estimate and compare survival probabilities for one or more groups Nonparametric Tests Many analogous to standard parametric tests P-Values p = Probability of obtaining a result at least this extreme given the null is true. P-values are probabilities 0<p<1 Computed from distribution of the test statistic Epidemiological Concepts Rate a proportion, specifically a fraction, where The numerator, c, is included in the denominator: c cd -Useful for comparing groups of unequal size Example: neonatal mortatilty rate # deaths 28 days old total # live births Epidemiological Concepts Measures of Morbidity: Incidence Rate: # new cases occurring during a given time interval divided by population at risk at the beginning of that period. Prevalence Rate: total # cases at a given time divided by population at risk at that time. Epidemiological Concepts Most people think in terms of probability (p) of an event as a natural way to quantify the chance an event will occur => 0<=p<=1 0 = event will certainly not occur 1 = event certain to occur But there are other ways of quantifying the chances that an event will occur…. Epidemiological Concepts Odds and Odds Ratio: expected # times an event will occur Odds of an event O expected # times the event will not occur For example, O = 4 means we expect 4 times as many occurrences as non-occurrences of an event. In gambling, we say, the odds are 5 to 2. This corresponds to the single number 5/2 = Odds. Epidemiological Concepts The relationship between probability & odds p prob of event O 1 - p prob of no event p O 1 O Epidemiological Concepts Probability .1 .2 .3 .4 .5 .6 .7 .8 .9 Odds .11 .25 .43 .67 1.00 1.50 2.33 4.00 9.00 Odds<1 correspond To probabilities<0.5 0<Odds< Example 1: Odds Ratio Death sentence by race of defendant in 147 trials Blacks Nonblacks Total Death 28 22 50 Life 45 52 97 Total 73 74 147 Example 2: Odds Ratio Odds of death sentence = 50/97 = 0.52 For Blacks: O = 28/45 = 0.62 For Nonblacks: O = 22/52 = 0.42 Ratio of Black Odds to Nonblack Odds = 1.47 This is called the Odds Ratio 28 28 * 52 1456 45 OR 1.47 22 22 * 45 990 52 Logistic Regression Odds ratios are directly related to the parameters of the logit (logistic regression) model. Logistic Regression is a statistical method that models binary (e.g., Yes/No; T/F; Success/Failure) data as a function of one or more explanatory variables. We would like a model that predicts the probability of a success, ie, P(Y=1) using a linear function. Logistic Regression Problem: Probabilities are bounded by 0 and 1. But linear functions are inherently unbounded. Solution: Transform P(Y=1) = p to an odds. If we take the log of the odds the lower bound is also removed. Setting this result equal to a linear function of the explanatory variables gives us the logit model. Logistic Regression Logit or Logistic Regression Model pi 1 X i1 2 X i 2 k X ik log 1 pi Where pi is the probability that yi = 1. The expression on the left is called the logit or log odds. Logistic Regression Probability of success: pi PY 1 1 1 e 1X i1 2 X i 2 k X ik Odds Ratio for Each Explanatory Variable: OR for X i e i Screening Tests Suppose a new screening test for herpes virus has been developed and the following summary for 1000 individuals has been compiled: Has Herpes Does Not Have Herpes Screened Positive 45 10 Screened Negative 5 940 Screening Tests How do we evaluate the usefulness of such a test? Diagnostics: sensitivity specificity False positive rate False negative rate predictive value positive predictive value negative Screening Tests Generic Screening Test Table Screened Positive Screened Negative Total With Disease a Without Disease b Total c d c+d a+c b+d N a+b Screening Tests prevalence Sensitivity ac N a d Specificity ac bd b False positive rate bd c False negative rate ac a Yield or predictive value ab d Yield or predictive value cd Screening Tests 50 prevalence 5% 1000 Sensitivit y 45 940 90 % Specificit y 98.95% 50 950 10 False positive rate 1.05% 950 5 False negative rate 10% 50 45 Yield or predictive value 81.82% 55 940 Yield or predictive value 99.47% 950 Interval Estimation Statistics such as the sample mean, median, variance, etc., are called point estimates -vary from sample to sample -do not incorporate precision Interval Estimation Take as an example the sample mean: Estimates X ——————> (popn mean) Or the sample variance: S2 ——————> 2 (popn variance) Interval Estimation Recall Example 1, a one-sample t-test on the population mean. The test statistic was x 0 t s n This can be rewritten to yield: Interval Estimation x 0 P t1 t1 1 s 2 2 n Which can be rearranged to give a (1-)100% Confidence Interval for : x t1 Form: Estimate ± 2 , n1 s n Multiple of Std Error of the Est. Interval Estimation Example 1: Standing SBP Mean = 140.8, s.d. = 9.5, N = 12 95% CI for : 140.8 ± 2.201 (9.5/sqrt(12)) 140.8 ± 6.036 (134.8, 146.8)