CTSI BERD Research Methods Seminar Series Statistical Analysis I Mosuk Chow, PhD Senior Scientist and Professor Department of Statistics December 8, 2015 Biostatistics, Epidemiology, Research Design(BERD) BERD Goals: Match the needs of investigators to the appropriate biostatisticians/epidemiologists/methodologists Provide BERD support to investigators Offer BERD education to students and investigators via in-person, videoconferenced, and on-line classes http://ctsi.psu.edu/ctsi-programs/biostatisticsepidemiologyresearch-design/ Statistics Encompasses Study design Data collection Summarizing data Selection of efficient design (cohort study/case-control study) Sample size Randomization Important first step in understanding the data collected Analyzing data to draw conclusions Communicating the results of analyses Keys to Successful Collaboration Between Statistician and Investigator: A Two-Way Street Involve statistician at beginning of project (planning/design phase) Specific objectives Communication avoid jargon willingness to explain details Keys to Successful Collaboration: A Two-Way Street Respect Knowledge Skills Experience Time Embrace statistician as a member of the research team Fund statistician on grant application for best collaboration Most statisticians are supported by grants, not by Institutional funds Statistical Analysis Describing data Statistical Inference Numeric or graphic Estimation of parameters of interest Hypothesis testing Regression modeling Interpretation and presentation of the results Describing data: Basic Terms Measurement – assignment of a number to a characteristic of an object or event Data – collection of measurements Sample – collected data Population – all possible data Variable – a property or characteristic of the population/sample – e.g., gender, weight, blood pressure. Example of data set/sample Data on albumin and bilirubin levels before and after treatment with a study drug ID 6 7 8 11 13 16 21 2 15 19 24 34 43 DRUG 0 0 0 0 0 0 0 1 1 1 1 1 1 BILI ALBUMIN BASE_BIL BASE_ALB 0.7 4.2 0.8 3.98 1.2 3.59 1 4.09 1.3 3.08 0.3 4 2.1 3.58 1.4 4.16 1.1 3.39 0.7 3.85 0.6 3.8 0.7 3.66 1.7 3.22 0.6 3.83 3.6 2.92 1.1 4.14 1.2 3.72 0.8 3.87 0.4 3.92 0.7 3.56 3.6 3.66 2.1 4 0.8 3.85 0.8 3.7 0.7 3.78 1.1 3.64 Describing Data Types of data Summary measures (numeric) Visually describing data (graphical) Types of Variables Qualitative or Categorical Binary (or dichotomous) True/False, Yes/No Nominal – no natural ordering Ethnicity Ordinal – Categories have natural ranks Degree of agreement (strong, modest, weak) Size of tumor (small, medium, large) Quantitative Ratio - Ordered, constant scale, natural zero (age, weight) Interval-Ordered, constant scale, no natural zero Differences make sense, but ratios do not Temperature in Celsius (30°-20°=20°-10°, but 20°/10° is not twice as hot) Types of Measurements for Quantitative Variables Continuous: Weight, Height, Age Discrete: a countable number of values The number of births, Age in years Likert scale: “agree”, “strongly agree”, etc. Somewhere between ordinal and discrete Scales with <= 4 possibilities are usually considered to be ordinal. Scales with >=7 possibilities are usually considered to be discrete. Descriptive Statistics Quantitative variable Measure(s) of central location/tendency Mean Median Mode Measure(s) of variability (dispersion) describe the spread of the distribution Descriptive Statistics (cont.) Summary Measures of dispersion/variation Minimum and Maximum Range = Maximum – Minimum Sample variances (abbreviated s2) and standard deviation (s or SD) with denominator=n-1 Other Measures of Variation Interquartile range (IQR): 75th percentile – 25th percentile MAD: median absolute deviation CV: Coefficient of variation s CV = ´100% X Ratio of SD over sample mean Measure relative variability Independent of measurement units Useful for comparing two or more sets of data Describing data graphically Tell whole story of data, detect outliers Histogram Stem and Leaf Plot Box Plot Histogram 10 5 • The height represents the number of individuals in that range of SBP. 0 Number of Men • Each bar spans a width of 5 mmHg. 15 20 • 113 men 80 100 120 140 Systolic BP (mmHg) Divide range of data into intervals (bins) of equal width. Count the number of observations in each class. 160 4 2 0 0 20 40 Number of Men 60 6 Histogram of SBP 80 100 120 140 160 80 100 120 140 Systolic BP (mmHg) Systolic BP (mmHg) Bin Width = 20 mmHg Bin Width = 1 mmHg 160 Stem and Leaf Plot Provides a good summary of data structure Easy to construct and much less prone to error than the tally method of finding a histogram 2889 301112334455556667777899 4001111122333444455567789 5011234 “stem”: the first digit or digits of the number. “leaf” : the trailing digit. Box Plot: SBP for 113 Males Boxplot of Systolic Blood Pressures 160 Sample of 113 Men Largest Observation 120 25th Percentile 80 Sample Median Blood Pressure 100 140 75th Percentile Smallest Observation Descriptive Statistics (cont.) Categorical variable Frequency (counts) distribution Relative frequency (percentages) Pie chart Bar graph Describe relationship between two variables One quantitative and one categorical Descriptive statistics within each category Side by side boxplots/histograms Both quantitative Scatter plot Both categorical Contingency table Statistical Inference A process of making inference (an estimate, prediction, or decision) about a population (parameters) based on a sample (statistics) drawn from that population. Sample Inference Population 20 15 0 5 10 .2 .1 0 Percentage .3 Parameters (Fixed, unknown) Number of Men .4 Statistics (Vary from sample to sample) 80 100 120 140 Systolic BP (mmHg) 160 180 80 100 120 Systolic BP (mmHg) 140 160 Statistical Inference Questions to ask in selecting appropriate methods Are observation units independent? How many variables are of interest? Type and distribution of variable(s)? One-sample or two-sample problem? Are samples independent? Parameters of interest (mean, variance, proportion)? Sample size sufficient for the chosen method? (see decision making flow chart in the handout) Estimation of population mean We don’t know the population mean μ but would like to estimate it. We draw a sample from the population. We calculate the sample mean X. How close is X to μ? Statistical theory will tell us how close X is to μ. Statistical inference is the process of trying to draw conclusions about the population from the sample. Key Statistical Concept Question: How close is the sample mean to the population mean? Statistical Inference for sample mean Sample mean will change from sample to sample We need a statistical model to quantify the distribution of sample means (Sampling distribution) Sometimes, need “normal distribution” for the population data Normal Distribution Normal distribution, denoted by N(µ, 2), is characterized by two parameters µ: The mean is the center. : The standard deviation measures the spread (variability). Probability density function Standard Deviation Mean Standard Deviation Mean Distribution of Blood Pressure in Men (population) .4 Y: Blood pressure Y~ N(µ, 2) Parameters: Mean, µ= 125 mmHg SD, = 14 mmHg .3 68% .2 95% 99.7% .1 0 83 97 111 125 139 153 167 The 68-95-99.7 rule for normal distribution applied to the distribution of systolic blood pressure in men. Sampling Distribution The sampling distribution refers to the distribution of the sample statistics (e.g. sample means) over all possible samples of size n that could have been selected from the study population. If the population data follow normal distribution N(µ, 2), then the sample means follow normal distribution N(µ, 2/n). What if the population data do not come from normal distribution? Central Limit Theorem (CLT) If the sample size is large, the distribution of sample means approximates a normal distribution. ~ N(µ, 2/n) The Central Limit X Theorem works even when the population is not normally distributed (or even not continuous).http://onlinestatbook.com/stat_sim/sampling_dist/index.h tml For sample means, the standard rule is n > 60 for the Central Limit Theorem to kick in, depending on how “abnormal” the population distribution is. 60 is a worst-case scenario. Sampling Distribution By CLT, about 95% of the time, the sample mean will be within two standard errors of the population mean. This tells us how “close” the sample statistic should be to the population parameter. Standard errors (SE) measure the precision of your sample statistic. A small SE means it is more precise. The SE is the standard deviation of the sampling distribution of the statistic. Standard Error of Sample Mean The standard error of sample mean (SEM) is a measure of the precision of the sample mean. SEM = n : standard deviation (SD) of population distribution. The standard deviation is not the standard error of a statistic! Example Measure systolic blood pressure on random sample of 100 students Sample size n = 100 Sample mean x = 125 mm Hg Sample SD s = 14.0 mm Hg 14 1.4 mmHg SEM = 100 Population SD () can be replaced by sample SD for large sample Confidence Interval for population mean An approximate 95% confidence interval for population mean µ is: X ± 2×SEM or precisely X ±1.96 SEM X is a random variable (vary from sample to sample), so confidence interval is random and it has 95% chance of covering µ before a sample is selected. Once a sample is taken, we observe X x , then either µ is within the calculated interval or it is not. The confidence interval gives the range of plausible values for µ.