Statistics - Stony Brook University School of Medicine

Research Tools Robert Woroniecki, MD, MS Why do we need research tools? • To answer research questions, i.e. • Choice of tools will depend on asked question • Question choice will depend on – Motivation – Available tools – People’s around Telemachus and Mentor Antonie van Leeuwenhoek & microscope Darwin & Beagle Wilhelm Röntgen & X-ray Jonas Salk & Henrietta Lacks Ronald Fisher & statistical tests Mario Capecchi & knockout mice A tool is any item that can be used to achieve a goal, especially if the item is not consumed in the process. IRB • A federally mandated committee charged with responsibility: – To review proposed research – To ensure that the rights of research participants are protected – To ensure that risk of harm to participants is minimized – To ensure selection of subject is equitable – To ensure additional safeguards for any vulnerable subjects Formulating Question and Choosing the Tool • Formulating a study question is the first step in a designing research project and choosing the tools that might work. • Knowing you want to do a research in an area is not enough eg. - I want to do research on kidney or kidney transplantation - too general, and not a question Examples of possible research questions - What proportion of ESRD patients receive kidney transplant - Does blood type determine length of waiting time on tx list How to select research question • Interest/motivation • What will be the impact of your findings? • Novelty , if already done is there room for improvement? Feasibility: do you have tools? • Time – residency lasts only 3 years, time goes by fast!! • Sufficient study population – can enroll enough subjects to report a meaningful result The 4-question schema Causation (mechanism) Ontogeny (development) Adaptation (function) Phylogeny (evolution) Molecule Cell Organ Individual Family Group Society Alcock, 2001 Study Question Quality • Descriptive – are descriptive ! eg. Prevalence of CKD in a specified population Graft survival after kidney transplantation Proportion of patients with Alport’s syndrome who go on to develop ESRD. Proportion of patients who develop CMV viremia in the first year of Tx • Analytic - comparative eg. Is graft survival better with LRD when compared to DDRT ? Do ACE/ARB reduce the risk of progression to ESRD in patients with alport’s? Are patients who receive prophylaxis less likely to develop CMV viremia? Has your question been answered? Look up literature and keep references organized! • PubMed: http://www.ncbi.nlm.nih.gov/pubmed • EndNote: http://it.stonybrook.edu/software/title/endnote Hypothesis Primary research question should be driven by the hypothesis rather than the data, i.e. a statement of expectation or prediction that will be tested by research • Null Hypothesis (H0) - no difference - no association • Alternative Hypothesis (HA) - difference exists - association exists Eg. H0 : there is no difference in graft survival between deceased and living donor kidney transplants H0: There is no association between dialysate sodium concentration and post dialysis serum sodium concentration in adult patients receiving hemodialysis Study Question • Analytic are more interesting than descriptive • Answering analytic questions enables development of intervention • For Both types - Need to specify study population eg. Men, Adults, elderly, kidney transplant recipients, ESRD patients etc.. Data and tools to manage it Variables • Continuous (interval) age, BP, temp, cholesterol, mRNA, immunofluorescence optic density • Categorical – Dichotomous eg Yes or No, Male or female, Alive or dead – Ordinal eg. Scale - NYHA classification for CHF, Likert scale – Nominal : can’t be ordered eg. Ethnicity DEFINE OUTCOME For conducting a study you need to decide - what is the outcome of interest? eg. graft survival - what variable best describes/measures your outcome of interest. eg. serum creatinine (interval variable) return to dialysis (categorical variable yes/no) - What independent variables should you also measure eg. age, gender, ethnicity, DM, HTN • Outcome variable = dependent variable • Independent variable= explanatory variable, grouping variable Data collection • Source - National databases eg. UNOS, USRDS, DHS - DCI Institutional database (EMR) Questionnaire Patient Chart Interview • Entry/Storage - EpiData or Epi-info http://www.epidata.dk http://wwwn.cdc.gov/epiinfo/7/ - Excel - Access • Data analysis software – SPSS, STATA, SAS Software Software http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize Univariate Statistics Analysis of a single variable. Continuous Categorical Variable may be independent or outcome variable Continuous :- Age of the study population, eGFR, blood pressure - mean - median - mode - variance & standard deviation Categorical variables/discrete variables eg. Gender, ethnicity , death, ESRD, Likert scale - Proportion/percentage - Frequency tables - Events that occur over time - survival curves and incidence rate Graphic Display of Univariate statistics Graphic Display of Univariate statistics Graphic Display of Univariate statistics Graphic Display of Univariate statistics Graphic Display of Univariate statistics You have to know what you are doing What is the problem with this picture? Bivariate statistics Measure of association between 2 variables. Association of discrete outcomes (categorical variables) (eg. yes/no) • Chi square test • Fisher’s exact Continuous outcomes: • t-test • ANOVA • Correlation Measuring association between 2 dichotomous variables Condition (disease) No Condition (no disease) Incidence Incidence in exposed = a/a+b Exposed a b Unexposed Incidence in unexposed = c/c+d C d Odds Odds (probability of disease) in exposed= a/b Odds of disease in unexposed = c/d • Relative risk : - the risk of developing disease Ratio of incidence in exposed to that in a b unexposed. C d (a/a+b)/ (c/c+d) • Odds Raito :- odds of developing disease in exposed divided by that in unexposed. (a/b)/(c/d) = a*d/b*c • For rare diseases odds ratio approximates relative risk ! Chi Square test • Compares observed outcome to expected outcome. • Independent and outcome variable are both categorical OR=ad/bc CMV NO CMV Valcyte a b No Valcyte C d If expected cell size <5 then use Fisher’s exact test t-test • Association of a dichotomous variable and a continuous variable • Independent variable is dichotomous and outcome variable is continuous. • Used to compare the mean between two groups. eg. eGFR between men and women • t=mean of eGFR in men – mean of eGFR in women Standard error of the difference between the two Assumptions – eGFR is normally distributed in both men and women The variances are equal Non parametric test if assumptions not met – Mann-Whitney test which is based on ranking instead of mean (mean is not an accurate reflection of distribution in a non-normally distributed population) ANOVA • Association of a nominal variable with a continuous variable • Compare means in more than 2 groups eg. Difference in mean blood pressure readings between blacks, whites and Hispanics. Assumption – normal distribution, equal variance If assumptions not met – Kruskal-wallis test Correlation ≠ causation • Measure of the strength and direction of a relationship between two continuous variables • Typically represented by Pearson r • Eg. Age and eGFR Pearson’s correlation coefficient (r) ranges from -1 to +1 0 means there is no relationship Correlation ≠ causation • Possible causes of correlation: reverse causation, common causes If non normally distributed continuous variables use spearman’s rank correlation. Confounding • Apparent association between a risk factor and an outcome is affected by the relationship of a third variable to the risk factor. Risk Factor Outcome Confounder Confounding Example How do you prevent confounding ? OBESITY CKD DM II Eg. If you want to investigate whether obesity is a risk factor for CKD, type II DM (which is also associated with obesity) may be a confounder. If you observe that obesity is a risk factor for CKD must make sure the effect you are seeing is not the effect of DM II on CKD . Question How can you minimize confounding? • If confounder is unknown – randomization • If confounder is known – adjust for it Multivariate analysis • A statistical tool for determining the unique (independent) contribution of various factors to a single outcome. • Essential b/c most clinical events have more than one cause and a number of confounders. We live in a multivariable world – most outcomes have multiple causes. Eg. A bivariate analysis may tell us that smoking, obesity, sedentary life style, hypertension and diabetes are associated with an increased risk for coronary artery disease but … are these factors independent of one another? i.e does a risk factor remain significant after adjusting for the other risk factors ? Useful to eliminate confounding during the analytic phase of a study! Multivariable analysis • Multiple Linear regression • Multiple Logistic regression • Survival analysis & Proportional hazards analysis Linear Regression • • • • Outcome variable is continuous eg. Creatinine, age, blood pressure Independent variables may be continuous or categorical Also determines the strength of association between outcome and independent variables Specifically allows one to predict and quantify what happens to an outcome variable for different values of independent variables. • Simple (bivariate) regression: What value of Y would we predict given a value of X? For multiple regression –adjust for additional regressors (independent variables) that may also affect outcome • Increase in risk associated with a unit change in the independent variable – – How much does income rise with one more year of education. How much more (or less) income do males earn (as compared to females)? Each additional year of education is associated with x amount increase in income. Eg. For each additional year of education, annual income increases by an average of 1000 dollars. Compared to women, on average men earn x amount more income Logistic regression • Dichotomous outcome • Independent variable may be categorical or continuous • Gives odds ratios Eg. Effect of weight on development of diabetes Effect of race on development of hypertension As weight increases the odds of developing diabetes also increases by x fold. Blacks are more likely to be diagnosed with hypertension that Caucasians ( OR= x) Interpretations Eg. OR= 2 => 2 fold increased risk of … OR=1.3 => 30% increased risk of … Multiple Logistic regression • Outcome variable dichotomous • Reports odds ratio • Adjusts for other variables in the model - > reveals risk independent of the other variables in the model Interpretation of OR • Odds ratio of 1 means no difference in risk between 2 groups • OR <1 means less risk • OR >1 means higher risk What does a regression do? • Predict a dependent variable (or response variable or outcome) using an independent variable (or explanatory variable or treatment) • Show the “effect” of the independent variable on the dependent variable – Independent variables also called: • explanatory variable • treatment variable – Dependent variable also called: • response variable • outcome variable Choosing regressors • Select potentially relevant variables on basis of theoretical arguments rather than statistical ones • From statistical arguments, there is always small probability of drawing the wrong conclusions (prob of rejecting null when null is true-Type I error) How to choose regressors? • Include potential confounders • Factors identified in previous studies • Factors hypothesized to matter on substantive grounds • DON’T include alternative measures of the outcome, predictors eg. GFR and creatinine • DON’T include intervening variables eg. If studying the effect of education on income do not include occupation Adjusted Risk factors for death and graft failure Death Graft Failure OR (95% CI) OR (95% CI) CMV Disease 3.44(1.29-9.17)* 2.9(1.26-6.65)* Age 1.03(1.01-1.06)* 0.99(0.98-1.02) Gender 1.35(0.66-2.81) 0.96(0.55-1.70) Black † 0.4(0.15-1.04) 1.76(0.93-3.34) Hispanic† 0.61(0.22-1.39) 0.91(0.40-2.10) Live Donors 0.56(0.27-1.15) 0.40(0.21-0.77)* CAD 1.35(0.66-2.77) 1.29(0.69-2.43) HTN 0.45(0.09-2.2) 0.48(0.14-1.66) Diabetes 0.97(0.47-1.9) 1.23(0.66-2.28) Events that occur over time • Compare outcomes that occur over time • outcome variable is the time until the occurrence of an event of interest. • Kaplan Meier analysis – use log rang test to assess survival difference between two groups • Proportional hazard analysis – comparison of event rate in two or more groups. Assumes a given risk factor is constant over the entire study period (proportionality assumption) A risk factor is independently associated with an outcome when the effect persists after taking into account the other risk factors and confounders. Software Outline • Study Design – – – – – – – Study question Choosing the study design Hypothesis Type I and Type II error Power and Sample size Confounding and Bias Statistical significance • Data collection and management • Statistical Analysis - Univariate statistics - Bivariate statistics - Multivariate statistics - Predictive studies Predictive studies • • • • • Sensitivity Specificity Positive predictive value Negative predictive value Likelihood ratio Sensitivity Proportion of people with the disease who are positive on the test Subjects with positive results/total number of subjects with disease = TP/TP+FN Specificity Proportion of subjects without the disease who are negative on the test Subjects with true negative results/total number of subjects without the disease = TN/TN+FP • No mater how sensitive the test it does not help to rule in the diagnosis • b/c sensitivity does not tell you the possibility of a positive test is a false positive • Conversely no matter how specific a test it does not help to rule out the diagnosis b/c it does not tell you the possibility that your negative results is a false negative Predictive values • PPV = subjects with true positive test/total number of subjects with positive results = TP/TP+FP • NPP= subjects with true negative test/total number of subjects with negative results = TN/TN+FN Accuracy = TP+TN/Total sample size Likelihood Raito Likelihood ratio of a positive test = probability of a positive test in someone with disease probability of positive test in someone without the disease = Sensitivity/(1-specificity) Likelihood ratio of a negative test = Probability of a negative test in someone with the disease Probability of a negative test in someone without the disease = (1-Sensitivity)/specificity Study Design • Observational - Investigator assesses a study population without altering the condition or group assignment eg. - Cross sectional - Cohort - prospective or retrospective - Case control • Experimental - Investigator manipulates the condition or group assignment. Typically one group receives a treatment and the other group receives a different treatment or a placebo. Cross-sectional • • • • Easy to conduct Fast Takes a snap-shot of the study population Used to answer descriptive questions eg. Prevalence of a disease • Not good at answering analytic questions – cause and effect can not be established  Helpful to determine prevalence which is used in estimating sample size for analytic studies. Cohort Studies Prospective cohort Study population is assembled prior to development of an outcome and followed overtime. At entry subjects are assessed for exposures of interest and evaluated to make sure they do not already have the outcome being studied. eg. If you want to follow individuals prospectively to see if they would develop CKD and assess risk factors for CKD it doesn’t make sense to include CKD patients in the cohort. Provide stronger evidence for a causal relationship and help to exclude reverse causality Eliminate recall bias Determine incidence Cohort studies cont.. Famous prospective cohort Framingham Heart Study Disadvantage – take tooo long esp if disease develops slowly , costly , inefficient for studying rare diseases. Retrospective Cohort – outcome has already occurred – you go back (recall bias) Case-control studies • Subjects are assembled based on whether they have experienced the outcome (cases) or not (controls). • Once cases and controls identified the frequency or risk factors are compared b/n the groups. • Advantage: Efficient esp for studying rare diseases • Recall bias is a problem • Selection bias – if disease is deadly most of the cases are dead and “cases” may not be fully representative Type I error (aka alpha error) • Rejecting the null hypothesis when it is true. Null Hypothesis (H0) – no difference/no association - TRUE Alternative Hypothesis (HA) – difference/ association exists - False • The probability of a type I error is the level of significance of the test of hypothesis • Denoted by alpha (α) • Mostly set at 0.05 Statistical significance • .05 (i.e., a 95% probability of not detecting an impact, when if fact there is no impact) • Or “The statistical analysis shows that there is <5% chance of the findings being due to chance alone” • When alpha <0.05 we reject the null hypothesis • Rejecting the null=>implicit evidence for the alternative hypothesis • There is a statistically significant difference! Type II error • occurs when one fails to reject the null hypothesis when it is actually false (i.e the alternative hypothesis is true) Null Hypothesis (H0) – no difference/no association - False Alternative Hypothesis (HA) – difference/ association exists - TRUE Conclude there is no difference when one exists • The probability of a type II error is denoted by beta (β) Power • Probability of rejecting the null when Ha is true (the alternative hypothesis) Null Hypothesis (H0) – no difference/no association - False Alternative Hypothesis (HA) – difference/ association exists - TRUE • Alternative phrasing: “detecting an impact, when in fact there is an impact” • Power = 1-Prob(Type II error) • Power = 1- β • Power of .80 is desirable (i.e., an 80% probability of detecting an impact, when in fact there is an impact) Type I error Type II error Power α β 1-β H0 True False False HA False True True Action Reject Fail to reject Reject Probability 0.05 0.2 0.8 Sampling Why sample ? Not feasible to study the entire population Uses of sampling • Make generalizations about the population of interest • A large population of interest can be studied efficiently and accurately through examination of a carefully selected subset or sample • Obtain information at an acceptable cost Sample ideally is representative of the population under study. Confidence Interval • Confidence Interval = Best Guess (Point Estimate) plus or ± minus about 2 standard errors • Provides a range or boundary around a sample estimate • This range of values will contain the truth 95% of the time. Sample Size calculation Depends on the type of the study design and type of variable and analysis planned Eg. For a bivariate analysis comparing 2 means you will need - Alpha level (Type 1 error ) mostly 0.05 Power (Type II error) mostly 80% Effect size (anticipated difference b/n the two means) Standard deviation of the interval variable (estimated sd of the variable) *If you have a sample power can be calculated Softwares! Things to consider when planning a study • • • • • • Confounding Bias Generalizability Length of time to conduct Minimize expense Addressing a broader range of questions • • • • • • Confounding Bias Generalizability Length of time to conduct Minimize expense Addressing a broader range of questions • • • • • • Confounding Bias Generalizability Length of time to conduct Minimize expense Addressing a broader range of questions Bias • Systematic Error • Can occur with both randomized and observational studies • Can occur at all stages of the study Selection of subjects eg. Investigator steering subjects into one group. Measurement eg. Investigator and subjects forming an expectation based on the subjects’ assigned group that alters their assessment of improvement Follow up eg. Subjects assigned to placebo group may drop out of the study. In a study investigating if angioplasty is better than medical management for renal artery stenosis – if patients with poor baseline renal function are preferentially assigned to the angioplasty group – the results of the study may be biased against angioplasty. How do you eliminate bias? Strategy to eliminate bias • Group assignment should be done at the time of enrollment by someone who has not contact with the participant using a random table generator. • Blinding & Double Blinding But not always possible + sometimes ethical problem eg. Sham procedure – giving IV contrast without angioplasty to compare PTA vs medical management for RAS . • • • • • • Confounding Bias Generalizability Length of time to conduct Minimize expense Addressing a broader range of questions Generalizability • Ability to apply the results of a study to a population other than the study sample. • More of a problem with randomized studies. – Conditions of randomized subjects are different from the conditions of clinical practice – given more attention and monitoring – Randomized subjects by definition different from the general population – volunteer, willing to take frequent exams and blood draws etc….  Here lies the difference between efficacy and effectiveness of a treatment!!! • Observational studies more closely approximate treatment effectiveness but still – Participants in observational studies may receive more attention than standard clinical care. Merely observing participants changes their behavior – Hawthorne effect! Increasing Generalizabillty requires Appropriate sampling to ensure sample is representative of the population under study. • • • • • • Confounding Bias Generalizability Length of time to conduct Minimize expense Addressing a broader range of questions Length of time to conduct • Observational studies may be faster to conduct if you have an existing database or can use a case control design. • But cohort studies (even observational) take long – more than one’s lifetime • • • • • • Confounding Bias Generalizability Length of time to conduct Minimize expense Addressing a broader range of questions Expense • Randomized – expensive!! • Even more expensive than observational cohort b/c in observational trial the cost of interventions is not paid by the study b/c investigators are merely observing whereas in trials the cost of all interventions – drugs, surgery, procedure, test is covered by the study. • • • • • • Confounding Bias Generalizability Length of time to conduct Minimize expense Addressing a broader range of questions Addressing a broader range of questions • Observational studies are able to answer a broader range of questions than randomized • Eg. It will be unethical to randomize people to smoke – but can observe outcome in smokers Propensity score adjustment can be done to see if there are differences in covariates (independent variables) among smokers vs non smokers. • Randomized studies are not helpful in identifying causes of disease outbreaks, food borne illnesses. Specific advantage of randomized and observational studies Randomized Eliminating Confounding X Minimizing Bias X Observational Increasing Generalizability X Speed in conducting study X Minimizing Expense X Addressing a broader range of questions X Reserve the use of observational studies to instances where it is unethical or infeasible to perform randomized controlled trials, or when time is of an essence in obtaining a result.

Statistics - Stony Brook University School of Medicine

Related documents

Products

Support

Statistics - Stony Brook University School of Medicine

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib