Short overview of statistical methods
Hein Stigum
Presentation, data and programs at:
http://folk.uio.no/heins/
courses
Apr-20
H.S.
1
Agenda
• Concepts
• Bivariate analysis
– Continuous symmetrical
– Continuous skewed
– Categorical
• Multivariable analysis
– Linear regression
– Logistic regression
Outcome variable decides analysis
Apr-20
H.S.
2
CONCEPTS
Apr-20
H.S.
3
Precision and bias
• Measures of populations
– precision - random error - statistics
– bias - systematic error - epidemiology
Precision
Bias
True
value
Apr-20
Estimate
H.S.
4
Precision: Estimation
Population
Sample
Estimate
True value
Estimate with confidence interval
(
|
)
95% confidence interval: 95% of repeated intervals
will contain the true value
Apr-20
H.S.
5
Precision: Testing
Population
Sample
Estimate 1
Estimate 2
True value group 1
True value group 2
|
group 1
|
group 2
p-value=P(observing this difference or more,
when the true difference is zero)
Apr-20
H.S.
6
Precision: Significance level
Birth weight, 500 newborn, observe difference
H0: boys=girls
10 gr
50 gr
100 gr
130 gr
150 gr
Significance level
p=0.90
p=0.40
p=0.10
p=0.04
p=0.02
p<0.05
Ha: boys≠girls
Apr-20
H.S.
7
Precision: Test situations
• 1 sample test
• Weight =10
• 2 independent samples
• Weight by sex
• K independent samples
• Weight by age groups
• 2 dependent samples
• Weight last year = Weight today
Apr-20
H.S.
8
Bias: DAGs
C2
C1
parity
sex
E
D
gest age
birth weight
Associations
Causal effects
Bivariate (unadjusted)
Multivariable (adjusted)
Draw your assumptions before your conclusions
Apr-20
H.S.
9
WHY USE GRAPHS?
Apr-20
H.S.
10
Problem example
• Lunch meals per week
30
0
10
20
Percent
40
50
– Table of means (around 5 per week)
– Linear regression
1
Apr-20
2
3
4
5
Lunch meals per week
H.S.
6
7
11
Problem example 2
• Iron level by sex
.02
.04
.06
.08
– Both linear and logistic regression
– Opposite results
0
mean mean
girls boys
75
Apr-20
90
100 104 110
Irom
levelininblood
blood
Iron level
H.S.
129
12
Datatypes
• Categorical data
– Nominal:
– Ordinal:
married/ single/ divorced
small/ medium/ large
• Numerical data
– Discrete:
number of children
– Continuous: weight
Apr-20
H.S.
13
Outcome data type dictates type of analysis
Data
type
Numerical
Yes
Means
T-test
Linear regression
Apr-20
Normal
data
Categorical
No
Medians
Non-par tests
H.S.
Freq table
Cross, Chisquare
Logistic regression
14
Continuous symmetric outcome: Birth weight
BIVARIATE ANALYSIS 1
Apr-20
H.S.
15
Distribution
drop if weight<2000
kdensity weight
0
.0002 .0004 .0006 .0008
0
Density
kdensity weight
0
2000
4000
6000
weight
0
2,000
4,000
2000
3000
4000
weight
5000
6000
6,000
weight
Apr-20
H.S.
16
Central tendency and dispersion
Mean and standard deviation:
Mean with confidence interval:
Apr-20
H.S.
17
Compare groups, equal variance?
• Equal
2
Apr-20
0
• Not equal
2
4
2
H.S.
0
2
4
18
2 independent samples
Are birth weights the same for boys and girls?
Density plot
2000
3000
4000
5000
6000
Scatterplot
Boys
Girls
2000
3000
sex
Apr-20
H.S.
4000
Birth weight
5000
6000
19
2 independent samples test
ttest weight, by(sex) unequal
ttest var1==var2
Apr-20
unequal variances
paired test
H.S.
20
K independent samples
• Is birth weight the same over parity?
Density plot
6000
Scatterplot
Parity:
2000
3000
4000
5000
0
1
2-7
0
Apr-20
1
Parity
2-7
2000
3000
H.S.
4000
Birth weight
5000
6000
21
K independent samples test
equal means?
Equal variances?
Apr-20
H.S.
22
Continuous by continuous
• Does birth weight depend on gestational age?
Scatterplot
4000
3000
2000
2000
3000
4000
Birth weight
5000
5000
6000
Scatterplot, outlier dropped
200
Apr-20
300 400 500 600
Gestational age
700
200 220 240 260 280 300
Gestational age
H.S.
23
Continuous by continuous tests
• Cut gestational age up in groups,
then use T-test or ANOVA
or
• Use linear regression with 1 covariate
Apr-20
H.S.
24
Test situations
• 1 sample test
• ttest weight =10
• 2 independent samples
• test weight, by(sex)
• K independent samples
• oneway weight parity
• 2 dependent samples (Paired)
• ttest weight_last_year == weight_today
Apr-20
H.S.
25
Continuous skewed outcome: Number of sexual partners
BIVARIATE ANALYSIS 2
Apr-20
H.S.
26
Distribution
kdensity partners if partners<=50
0
.02
.04
.06
.08
.1
Distribution of number of lifetime partners
25%50%
75%
95%
1 4
9
20
50
Partners
N=394
Apr-20
H.S.
27
Central tendency and dispersion
Median and percentiles:
Apr-20
H.S.
28
2 independent samples
Do males and females have the same number of partners?
Density plot
0
50
100
150
200
Scatterplot
Males
Females
0
Gender
Apr-20
H.S.
10
20
30
partners
40
50
29
2 independent samples test
equal medians?
Apr-20
H.S.
30
K independent samples
Do partners vary with age?
Density plot
200
Scatterplot
0
50
100
150
Age:
18-29
30-44
45-60
18-29
Apr-20
30-44
agegr3
45-60
0
H.S.
10
20
30
partners
40
50
31
K independent samples test
equal medians?
Apr-20
H.S.
32
Table of descriptives
Normal
Numerical data
Skewed
Proportions
Descriptives
Center
Dispersion
Mean
Standard deviation
Median
Fractiles
p
Confidence intervals for center estimates
Standard error
95% Confidence interval
Apr-20
se(mean)
mean ± 2*se(mean)
H.S.
se(p)
p ± 2*se(p)
33
Table of tests
Numerical data
Normal
Skewed
1 sample
One sample T-test
Kolmogorov-Smirnov
2 independent samples Independent sample T-test Mann-Whitney U
K independent samples ANOVA
Kruskal-Wallis
2 dependent samples Paired sample T-test
Wilcoxon signed rank test
Remarks:
If unequal variance in
ANOVA:
Use linear regression
with robust variance
estimation
Apr-20
If N is large:
may use parametric
tests
H.S.
Proportions
Binomial
Chi-square
Chi-square
Mc-Nemar (2x2)
Categorical
ordered:
use
nonparametric
tests
34
Categorical outcome: Being bullied
BIVARIATE ANALYSIS 3
Apr-20
H.S.
35
Frequency and proportion
Frequency:
Proportion with CI:
Apr-20
H.S.
36
Proportion, confidence interval
proportion:
x=”disease”
n=total number
x
p
n
p (1  p )
n
standard error:
se( p) 
confidence interval:
CI ( p )  p  2se( p)
Apr-20
H.S.
37
Crosstables
Are boys bullied as much as girls?
equal proportions?
Apr-20
H.S.
38
Ordered categories, trend
Trend?
equal proportions?
Apr-20
H.S.
39
Table of tests
Numerical data
Normal
Skewed
1 sample
One sample T-test
Kolmogorov-Smirnov
2 independent samples Independent sample T-test Mann-Whitney U
K independent samples ANOVA
Kruskal-Wallis
2 dependent samples Paired sample T-test
Wilcoxon signed rank test
Remarks:
If unequal variance in
ANOVA:
Use linear regression
with robust variance
estimation
Apr-20
If N is large:
may use parametric
tests
H.S.
Proportions
Binomial
Chi-square
Chi-square
Mc-Nemar (2x2)
Categorical
ordered:
use
nonparametric
tests
40
Continuous outcome: Linear regression, Birth weight
MULTIVARIABLE ANALYSIS 1
Apr-20
H.S.
41
Regression idea
2500 3000 3500 4000 4500 5000
model : y  b0  b1 x  e
y = outcome
x = covariate
b1  coefficien t , effect of x
e  error, residual
250
260
270
280
290
gestational age (days)
300
310
model with many cofactors : y  b0  b1 x1  b2 x2  e
x1 , x 2 = covariate
Apr-20
H.S.
42
Model and assumptions
• Model
y   0  1 x1   2 x2   ,   N (0,  2 )
• Association measure
1 = increase in y for one unit increase in x1
• Assumptions
– Independent errors
– Linear effects
– Constant error variance
• Robustness
– influence
Apr-20
H.S.
43
Workflow
C2
• DAG
parity
C1
sex
• Scatterplots
• Bivariate analysis
gest age
birth weight
– Robustness
4000
539
3000
• Independent errors
• Linear effects
• Constant error variance
2000
birth weight (gram)
5000
– Model estimation
– Test of assumptions
200
• Influence
Apr-20
D
6000
• Regression
E
H.S.
300
400
500
gestational age (days)
600
44
700
Categorical covariates
• 2 categories
– OK
• 3+ categories
– Use “dummies”
•
•
•
•
“Dummies” are 0/1 variables used to create contrasts
Want 3 categories for parity: 0, 1 and 2-7 children
Choose 0 as reference
Make dummies for the two other categories
generate Parity1
generate Parity2_7
Apr-20
=
=
(parity==1) if parity<.
(parity>=2) if parity<.
H.S.
45
Create meaningful constant
Expected b irth weigh t  E ( y ) 
 0  1  gest   2  sex   3  Parity1   4  Parity2 _ 7
Expected birth weight at:

0

 1925gr


0  1  280  2 1  3524gr
gest=
0, sex=0, parity=0, not meaningful
gest=280, sex=1, parity=0
Model estimation
Birth weight at ref
Gestational age
per day
Sex
Boy
Girl
Parity
0
1
2-7
Apr-20
coeff
3524.3
6.0
95% conf. Int.
(3.9 , 8.2)
0
-139.2
(-228.9 , -49.5)
0
232.0
226.0
(130.6 , 333.5)
(106.9 , 345)
H.S.
47
Test of assumptions
500
-1000
-500
0
Residuals
– Independent
residuals?
– Linear effects?
– constant
variance?
1000 1500
• Plot residuals
versus predicted y
3200
3400
3600
Linear prediction
3800
4000
Outlier not included
Apr-20
H.S.
48
Violations of assumptions
• Dependent residuals
.5
1
Use mixed models or GEE
-.5
0
• Non linear effects
-1
Add square term
220
240 260
gest
280
300
2
200
0
-1
-2
Use robust variance estimation
res
1
• Non-constant variance
3400
Apr-20
H.S.
3500
3600
p
3700
49
3800
6000
Influence
5000
Regression
without outlier
4000
Regression with outlier
2000
3000
Outlier
200
Apr-20
300
400
500
Gestational age
H.S.
600
700
50
.2
Measures of influence
-.6
-.4
-.2
0
Remove obs 1, see change
remove obs 2, see change
1
2
10
Id
• Measure change in:
– Predicted outcome
– Deviance
– Coefficients (beta)
• Delta beta
Apr-20
H.S.
51
-10
-8
-6
-4
-2
0
Delta beta for gestational age
539
2000
3000
4000
weight
5000
beta for gestational age= 6.04
Apr-20
H.S.
6000
If obs nr 539 is
removed, beta
will change
from 6 to 16
52
Removing outlier
Full model
Birth weight at ref
Gestational age
per day
Sex
Boy
Girl
Parity
0
1
2-7
Outlier removed
coeff 95% conf. Int.
3524
6
0
-139
0
232
226
Birth weight at ref
Gestational age
per day
Sex
Boy
Girl
Parity
0
1
2-7
(4 , 8)
(-229 , -49)
(131 , 333)
(107 , 345)
One outlier affected two estimates
Apr-20
coeff 95% conf. Int.
3531
17
(13 , 20)
0
-166
(-252 , -80)
0
229
225
(132 , 326)
(112 , 339)
Final model
H.S.
53
Binary outcome: Logistic regression, Being bullied
MULTIVARIABLE ANALYSIS 2
Apr-20
H.S.
54
Ordered categories and model
Interval versus ordered scale:
Interval scale
1
2
3
Ordered scale
low
Apr-20
medium high
Categories
Regression model
2
Logistic
3-7
Ordinal logistic
>7
Linear (treat as interval)
H.S.
55
Logistic model and assumptions
• Association measure
OR1  e 1 Odds ratio in y for 1 unit increase in x1
• Assumptions
– Independent errors
– Linear effects on the log odds scale
• Robustness
– influence
Apr-20
H.S.
56
Being bullied
• We want the total effect of country
on being bullied.
C1
age
E
D
country
bullied
C2
sex
– The risk of being bullied depends on age
and sex.
– The age and sex distribution may differ
between countries.
• Should we adjust for age and sex?
No, age and sex are mediating variables
Apr-20
H.S.
57
Logistic: being bullied
N
Country
Sweden
Island
Norway
Finland
Denmark
407
448
379
409
436
% p-value
<0.001
8.7
10.9
16.2
25.9
23.4
OR 95% conf. Int.
1
1.3
2.0
3.7
3.2
(0.8 , 2)
(1.3 , 3.2)
(2.4 , 5.6)
(2.1 , 4.9)
Roughly:
Same risk of being bullied in
Island as in Sweden.
2 times the risk in Norway
as in Sweden.
3 times the risk in Finnland
as in Sweden.
Prevalence of being bullied=17%
ORRR if outcome is rare
OR>RR (further from 1) if the outcome is common
Apr-20
H.S.
58
Summing up
• DAGs
– State prior knowledge. Guide analysis
• Plots
– Linearity, variance, outliers
• Bivariate analysis
– Continuous symmetrical Mean, T-test, anova
– Continuous skewed
Median, nonparametric
– Categorical
Freq, cross, chi-square
• Multivariable analysis
– Continuous
– Binary
Apr-20
Linear regression
Logistic regression
H.S.
59