Short overview of statistical methods Hein Stigum Presentation, data and programs at: http://folk.uio.no/heins/ courses Apr-20 H.S. 1 Agenda • Concepts • Bivariate analysis – Continuous symmetrical – Continuous skewed – Categorical • Multivariable analysis – Linear regression – Logistic regression Outcome variable decides analysis Apr-20 H.S. 2 CONCEPTS Apr-20 H.S. 3 Precision and bias • Measures of populations – precision - random error - statistics – bias - systematic error - epidemiology Precision Bias True value Apr-20 Estimate H.S. 4 Precision: Estimation Population Sample Estimate True value Estimate with confidence interval ( | ) 95% confidence interval: 95% of repeated intervals will contain the true value Apr-20 H.S. 5 Precision: Testing Population Sample Estimate 1 Estimate 2 True value group 1 True value group 2 | group 1 | group 2 p-value=P(observing this difference or more, when the true difference is zero) Apr-20 H.S. 6 Precision: Significance level Birth weight, 500 newborn, observe difference H0: boys=girls 10 gr 50 gr 100 gr 130 gr 150 gr Significance level p=0.90 p=0.40 p=0.10 p=0.04 p=0.02 p<0.05 Ha: boys≠girls Apr-20 H.S. 7 Precision: Test situations • 1 sample test • Weight =10 • 2 independent samples • Weight by sex • K independent samples • Weight by age groups • 2 dependent samples • Weight last year = Weight today Apr-20 H.S. 8 Bias: DAGs C2 C1 parity sex E D gest age birth weight Associations Causal effects Bivariate (unadjusted) Multivariable (adjusted) Draw your assumptions before your conclusions Apr-20 H.S. 9 WHY USE GRAPHS? Apr-20 H.S. 10 Problem example • Lunch meals per week 30 0 10 20 Percent 40 50 – Table of means (around 5 per week) – Linear regression 1 Apr-20 2 3 4 5 Lunch meals per week H.S. 6 7 11 Problem example 2 • Iron level by sex .02 .04 .06 .08 – Both linear and logistic regression – Opposite results 0 mean mean girls boys 75 Apr-20 90 100 104 110 Irom levelininblood blood Iron level H.S. 129 12 Datatypes • Categorical data – Nominal: – Ordinal: married/ single/ divorced small/ medium/ large • Numerical data – Discrete: number of children – Continuous: weight Apr-20 H.S. 13 Outcome data type dictates type of analysis Data type Numerical Yes Means T-test Linear regression Apr-20 Normal data Categorical No Medians Non-par tests H.S. Freq table Cross, Chisquare Logistic regression 14 Continuous symmetric outcome: Birth weight BIVARIATE ANALYSIS 1 Apr-20 H.S. 15 Distribution drop if weight<2000 kdensity weight 0 .0002 .0004 .0006 .0008 0 Density kdensity weight 0 2000 4000 6000 weight 0 2,000 4,000 2000 3000 4000 weight 5000 6000 6,000 weight Apr-20 H.S. 16 Central tendency and dispersion Mean and standard deviation: Mean with confidence interval: Apr-20 H.S. 17 Compare groups, equal variance? • Equal 2 Apr-20 0 • Not equal 2 4 2 H.S. 0 2 4 18 2 independent samples Are birth weights the same for boys and girls? Density plot 2000 3000 4000 5000 6000 Scatterplot Boys Girls 2000 3000 sex Apr-20 H.S. 4000 Birth weight 5000 6000 19 2 independent samples test ttest weight, by(sex) unequal ttest var1==var2 Apr-20 unequal variances paired test H.S. 20 K independent samples • Is birth weight the same over parity? Density plot 6000 Scatterplot Parity: 2000 3000 4000 5000 0 1 2-7 0 Apr-20 1 Parity 2-7 2000 3000 H.S. 4000 Birth weight 5000 6000 21 K independent samples test equal means? Equal variances? Apr-20 H.S. 22 Continuous by continuous • Does birth weight depend on gestational age? Scatterplot 4000 3000 2000 2000 3000 4000 Birth weight 5000 5000 6000 Scatterplot, outlier dropped 200 Apr-20 300 400 500 600 Gestational age 700 200 220 240 260 280 300 Gestational age H.S. 23 Continuous by continuous tests • Cut gestational age up in groups, then use T-test or ANOVA or • Use linear regression with 1 covariate Apr-20 H.S. 24 Test situations • 1 sample test • ttest weight =10 • 2 independent samples • test weight, by(sex) • K independent samples • oneway weight parity • 2 dependent samples (Paired) • ttest weight_last_year == weight_today Apr-20 H.S. 25 Continuous skewed outcome: Number of sexual partners BIVARIATE ANALYSIS 2 Apr-20 H.S. 26 Distribution kdensity partners if partners<=50 0 .02 .04 .06 .08 .1 Distribution of number of lifetime partners 25%50% 75% 95% 1 4 9 20 50 Partners N=394 Apr-20 H.S. 27 Central tendency and dispersion Median and percentiles: Apr-20 H.S. 28 2 independent samples Do males and females have the same number of partners? Density plot 0 50 100 150 200 Scatterplot Males Females 0 Gender Apr-20 H.S. 10 20 30 partners 40 50 29 2 independent samples test equal medians? Apr-20 H.S. 30 K independent samples Do partners vary with age? Density plot 200 Scatterplot 0 50 100 150 Age: 18-29 30-44 45-60 18-29 Apr-20 30-44 agegr3 45-60 0 H.S. 10 20 30 partners 40 50 31 K independent samples test equal medians? Apr-20 H.S. 32 Table of descriptives Normal Numerical data Skewed Proportions Descriptives Center Dispersion Mean Standard deviation Median Fractiles p Confidence intervals for center estimates Standard error 95% Confidence interval Apr-20 se(mean) mean ± 2*se(mean) H.S. se(p) p ± 2*se(p) 33 Table of tests Numerical data Normal Skewed 1 sample One sample T-test Kolmogorov-Smirnov 2 independent samples Independent sample T-test Mann-Whitney U K independent samples ANOVA Kruskal-Wallis 2 dependent samples Paired sample T-test Wilcoxon signed rank test Remarks: If unequal variance in ANOVA: Use linear regression with robust variance estimation Apr-20 If N is large: may use parametric tests H.S. Proportions Binomial Chi-square Chi-square Mc-Nemar (2x2) Categorical ordered: use nonparametric tests 34 Categorical outcome: Being bullied BIVARIATE ANALYSIS 3 Apr-20 H.S. 35 Frequency and proportion Frequency: Proportion with CI: Apr-20 H.S. 36 Proportion, confidence interval proportion: x=”disease” n=total number x p n p (1 p ) n standard error: se( p) confidence interval: CI ( p ) p 2se( p) Apr-20 H.S. 37 Crosstables Are boys bullied as much as girls? equal proportions? Apr-20 H.S. 38 Ordered categories, trend Trend? equal proportions? Apr-20 H.S. 39 Table of tests Numerical data Normal Skewed 1 sample One sample T-test Kolmogorov-Smirnov 2 independent samples Independent sample T-test Mann-Whitney U K independent samples ANOVA Kruskal-Wallis 2 dependent samples Paired sample T-test Wilcoxon signed rank test Remarks: If unequal variance in ANOVA: Use linear regression with robust variance estimation Apr-20 If N is large: may use parametric tests H.S. Proportions Binomial Chi-square Chi-square Mc-Nemar (2x2) Categorical ordered: use nonparametric tests 40 Continuous outcome: Linear regression, Birth weight MULTIVARIABLE ANALYSIS 1 Apr-20 H.S. 41 Regression idea 2500 3000 3500 4000 4500 5000 model : y b0 b1 x e y = outcome x = covariate b1 coefficien t , effect of x e error, residual 250 260 270 280 290 gestational age (days) 300 310 model with many cofactors : y b0 b1 x1 b2 x2 e x1 , x 2 = covariate Apr-20 H.S. 42 Model and assumptions • Model y 0 1 x1 2 x2 , N (0, 2 ) • Association measure 1 = increase in y for one unit increase in x1 • Assumptions – Independent errors – Linear effects – Constant error variance • Robustness – influence Apr-20 H.S. 43 Workflow C2 • DAG parity C1 sex • Scatterplots • Bivariate analysis gest age birth weight – Robustness 4000 539 3000 • Independent errors • Linear effects • Constant error variance 2000 birth weight (gram) 5000 – Model estimation – Test of assumptions 200 • Influence Apr-20 D 6000 • Regression E H.S. 300 400 500 gestational age (days) 600 44 700 Categorical covariates • 2 categories – OK • 3+ categories – Use “dummies” • • • • “Dummies” are 0/1 variables used to create contrasts Want 3 categories for parity: 0, 1 and 2-7 children Choose 0 as reference Make dummies for the two other categories generate Parity1 generate Parity2_7 Apr-20 = = (parity==1) if parity<. (parity>=2) if parity<. H.S. 45 Create meaningful constant Expected b irth weigh t E ( y ) 0 1 gest 2 sex 3 Parity1 4 Parity2 _ 7 Expected birth weight at: 0 1925gr 0 1 280 2 1 3524gr gest= 0, sex=0, parity=0, not meaningful gest=280, sex=1, parity=0 Model estimation Birth weight at ref Gestational age per day Sex Boy Girl Parity 0 1 2-7 Apr-20 coeff 3524.3 6.0 95% conf. Int. (3.9 , 8.2) 0 -139.2 (-228.9 , -49.5) 0 232.0 226.0 (130.6 , 333.5) (106.9 , 345) H.S. 47 Test of assumptions 500 -1000 -500 0 Residuals – Independent residuals? – Linear effects? – constant variance? 1000 1500 • Plot residuals versus predicted y 3200 3400 3600 Linear prediction 3800 4000 Outlier not included Apr-20 H.S. 48 Violations of assumptions • Dependent residuals .5 1 Use mixed models or GEE -.5 0 • Non linear effects -1 Add square term 220 240 260 gest 280 300 2 200 0 -1 -2 Use robust variance estimation res 1 • Non-constant variance 3400 Apr-20 H.S. 3500 3600 p 3700 49 3800 6000 Influence 5000 Regression without outlier 4000 Regression with outlier 2000 3000 Outlier 200 Apr-20 300 400 500 Gestational age H.S. 600 700 50 .2 Measures of influence -.6 -.4 -.2 0 Remove obs 1, see change remove obs 2, see change 1 2 10 Id • Measure change in: – Predicted outcome – Deviance – Coefficients (beta) • Delta beta Apr-20 H.S. 51 -10 -8 -6 -4 -2 0 Delta beta for gestational age 539 2000 3000 4000 weight 5000 beta for gestational age= 6.04 Apr-20 H.S. 6000 If obs nr 539 is removed, beta will change from 6 to 16 52 Removing outlier Full model Birth weight at ref Gestational age per day Sex Boy Girl Parity 0 1 2-7 Outlier removed coeff 95% conf. Int. 3524 6 0 -139 0 232 226 Birth weight at ref Gestational age per day Sex Boy Girl Parity 0 1 2-7 (4 , 8) (-229 , -49) (131 , 333) (107 , 345) One outlier affected two estimates Apr-20 coeff 95% conf. Int. 3531 17 (13 , 20) 0 -166 (-252 , -80) 0 229 225 (132 , 326) (112 , 339) Final model H.S. 53 Binary outcome: Logistic regression, Being bullied MULTIVARIABLE ANALYSIS 2 Apr-20 H.S. 54 Ordered categories and model Interval versus ordered scale: Interval scale 1 2 3 Ordered scale low Apr-20 medium high Categories Regression model 2 Logistic 3-7 Ordinal logistic >7 Linear (treat as interval) H.S. 55 Logistic model and assumptions • Association measure OR1 e 1 Odds ratio in y for 1 unit increase in x1 • Assumptions – Independent errors – Linear effects on the log odds scale • Robustness – influence Apr-20 H.S. 56 Being bullied • We want the total effect of country on being bullied. C1 age E D country bullied C2 sex – The risk of being bullied depends on age and sex. – The age and sex distribution may differ between countries. • Should we adjust for age and sex? No, age and sex are mediating variables Apr-20 H.S. 57 Logistic: being bullied N Country Sweden Island Norway Finland Denmark 407 448 379 409 436 % p-value <0.001 8.7 10.9 16.2 25.9 23.4 OR 95% conf. Int. 1 1.3 2.0 3.7 3.2 (0.8 , 2) (1.3 , 3.2) (2.4 , 5.6) (2.1 , 4.9) Roughly: Same risk of being bullied in Island as in Sweden. 2 times the risk in Norway as in Sweden. 3 times the risk in Finnland as in Sweden. Prevalence of being bullied=17% ORRR if outcome is rare OR>RR (further from 1) if the outcome is common Apr-20 H.S. 58 Summing up • DAGs – State prior knowledge. Guide analysis • Plots – Linearity, variance, outliers • Bivariate analysis – Continuous symmetrical Mean, T-test, anova – Continuous skewed Median, nonparametric – Categorical Freq, cross, chi-square • Multivariable analysis – Continuous – Binary Apr-20 Linear regression Logistic regression H.S. 59