Stata 3, Regression
Hein Stigum
Presentation, data and programs at:
http://folk.uio.no/heins/
Apr-20
H.S.
1
Agenda
• Linear regression
• GLM
• Logistic regression
• Binary regression
• (Conditional logistic)
10 April 2020
H.S.
2
Linear regression
Birth weight
by
gestational age
10 April 2020
H.S.
3
Regression idea
5000
model : y  b0  b1 x  e
4500
y = outcome
4000
e  error, residual
2500
3000
b1  coefficien t , effect of x
3500
x = covariate
240
260
280
gestational age (days)
300
320
model with many cofactors : y  b0  b1 x1  b2 x2  e
x1 , x 2 = covariate
10 April 2020
H.S.
4
Model and assumptions
• Model
y   0  1 x1   2 x2   ,   N (0,  2 )
• Assumptions
– Independent errors
– Linear effects
– Constant error variance
10 April 2020
H.S.
5
Association measure: RD
Model:
Start with:
y  β0  β1 x1  β2 x2
RD1  y x1  2  y x1 1
 β0  2 β1  β2 x2
 β0  1β1  β2 x2

Hence:
10 April 2020
β1
RD1  β1
H.S.
6
Purpose of regression
• Estimation
– Estimate association between outcome and
exposure adjusted for other covariates
• Prediction
– Use an estimated model to predict the
outcome given covariates in a new dataset
10 April 2020
H.S.
7
Adjusting for confounders
True value
True value
Unadjusted estimate
Adjusted estimate
• Not adjust
– Cofactor is a collider
– Cofactor is in causal path
• May or may not adjust
– Cofactor has missing
– Cofactor has error
10 April 2020
H.S.
8
Workflow
• Scatterplots
• Bivariate analysis
• Regression
– Model fitting
• Cofactors in/out
• Interactions
– Test of assumptions
• Independent errors
• Linear effects
• Constant error variance
– Influence (robustness)
10 April 2020
H.S.
9
4000
3000
2000
weight
5000
6000
Scatterplot
200
300
400
500
600
700
gest
10 April 2020
H.S.
10
Syntax
• Estimation
– regress y x1 x2
– xi: regress y x1 i.c1
linear regression
categorical c1
• Post estimation
– predict yf, xb
predict
• Manage models
– estimates store m1
10 April 2020
save model
H.S.
11
Model 1: outcome+exposure
10 April 2020
H.S.
12
Model 2: Add counfounders
Estimate association:
m1=m2
Prediction: m2 is best
10 April 2020
H.S.
13
”Dummies”
Assume educ is coded 1, 2, 3 for low, medium and high education
Choose low educ as reference
Make dummies for the two other categories:
generate medium=(educ==2) if educ<.
generate high =(educ==3) if educ<.
10 April 2020
H.S.
14
Interaction
Model:
Start with:
y  β0  β1 x1  β2 x2  β3 x1sex
RD1  y x1 2  y x1 1
 β0  2 β1  β2 x2  β3 2 sex
 β0  1 β1  β2 x2  β31sex

Hence:
10 April 2020
β1  β3sex
RD1  β1  β3sex
H.S.
15
Model 3: with interaction
10 April 2020
H.S.
16
Test of assumptions
• Predict y and residuals
– predict y, xb
– predict res, resid
2000
0
-1000
– independent?
– linear?
– const. var?
1000
• Plot resid vs y
twoway (scatter res y )(qfitci res y)
3200
3400
3600
3800
Linear prediction
10 April 2020
H.S.
17
Violations of assumptions
• Dependent residuals
.5
1
Mixed models: xtmixed
-1
200
220
240 260
gest
280
300
0
-1
-2
regress weigth gest sex, robust
res
• Non-constant variance
1
2
gen gest2=gest^2
regress weigth gest gest2 sex
-.5
0
• Non linear effects
3400
10 April 2020
H.S.
3500
3600
p
3700
18
3800
.2
Measures of influence
-.6
-.4
-.2
0
Remove obs 1, see change
remove obs 2, see change
1
2
10
Id
• Measure change in:
– Outcome (y)
– Deviance
– Coefficients (beta)
• Delta beta, Cook’s distance
10 April 2020
H.S.
19
Points with high influence
.8
lvr2plot, mlabel(id)
.4
.2
Leverage
.6
539
0
505
291
134
564
270
500
206
110
79
180
366
35
226
152
105
543
512
211
582
418
96
125
181
256
161
404
49
444
74
78
139
252
19
565
5381
136
56
61
191
176
82
36
106
313
137
269
274
399
480
545
576
33
85
81
254
400
571
70
103
214215
457
498
185
372374
91
112
581
353
217218
493
456
45
167
6
89
220
58
266
492
286
344345
440
38
389
540
485
327
311
190
289
111
439
212
209
437
251
526
95
75
88
476477
359
221
238
165
315
537
279
336
532
200
455
64
173
119
50
60
355
113
201
232
340
86
488
467
9
1112
31
115
129
177
267
323
338
349
351
354
357
363
414
413
448
453
466
483
486
533
538
548
552
1
30
48
66
65
93
153
229
231
253
277
292
416
426
490
7
14
25
32
205
230
241
303
396
464
481
497
546
43
68
99
116
138
249250
375
102
247
296
339
523524
580
126
144
195
304
365
410
420
482
503
235
244
278
284
364
513
568
573
72
98
178
316
403
452
69
151
237
306
446
451
489
42
118
198
210
219
394
409
407408
434
563
175
196
199
574
427
1
168
222
318
406
423
460
275
307
479
502
536
8384
124
154
294
239
295
319
397
401
501
557
202
0
5
130
142
128
108
390
432
511
59
334
422
465
87
109
171
575
157
322
324
424
186
463
527
101
301
384
478
560
13
149
90
290
367368
37
193
312
461
18
160
328
395
441
148
4
431
555
57
2
515
39
172
76
122
203
551
299
326
16
510
562
121
487
554
40
184
496
187
443
141
305
51
143
150
335
468
5
54
281
583
521
570
19
470
27
91
276
56
97
293
182
347
379
225
42
376
348
44
346
24
236
495
530
183
520
23
509
216
127
155
174
246
383
471
224
120
29
158
194
517
20
145
547
385
261
577
314
341
26
107
320
300
156
329
370
522
475
425
259
27
255
569
421
360
528
358
474
298
350
499
3264
94
388
146
525
321
179
192
77
21
131
43
41
100
213
535
170
449
472
393
387
248
273
325
208
484
132
436
519
331
412
572
234
164
262
80
429
378
391
163
438
245
283
405
447
529
258
271
135
28
332
508
553
263
386
310
454
228
189
223
430
514
133
566
104
541
4
462
242
402
3
240
579
369
1
97
380
337
371
428
567
352
188
280
559
361
504
330
516
435
507
506
169
233
469
392
287
62288
0
10 April 2020
445
285
.01
.02
Normalized residual squared
H.S.
.03
20
Added variable plot: gestational age
2000
avplot gest, mlabel(id)
285
-1000
0
1000
445
288
28467
62
287
392
113
405
529
106
258
60
233
50
438
169
119
232
378
173
429
163
164
262
506
234
532
519
484
412
537
325
516
213
165
504
21
132
77
192
525
170
298
526
343
535
200
388
179
528
475
341
156
320
27
61
499
358
559
547
517
246
383
556
359
75
259
261
352
577
522
329
23
495
183
370
155
512
29
44
139
371
225
509
524
78
127
212
224
520
348
190
327
293
311
380
337
250
554
408
143
440
404
256
218
347
281
203
286
110
136
468
172
470
197
562
510
16
266
209
542
335
431
369
328
57
148
305
367
289
101
167
240
395
3491
227
555
157
87
390
130
37
424
540
38
461
90
551
402
154
149
560
401
422
501
406
581
1295
108
432
372
133
83
124
185
511
104
430
307
451
171
566
514
446
151
418
409
98
460
219
237
400
423
126
6 427
196
199
394
178
247
318
489
284
365
571
118
195
254
407
198
304
503
138
396
206
420
226
278
241
230
30
137
274
582
235
153
231
253
546
399
545
10
453
99
105316
448
482
339
296
7
269
454
349
129
249
14
32
263
386
414
303
357
533
490
66
277
31
332
229
576
85
292
416
413
115
177
508
323
548
363
481
267
354
483
25
552
466
553
338
486
538
351
9
214
457
498
480
11
102
426
310
410
112
144
48
93
68
228
513
189
244
364
568
563
65
375
434
42
175
91
116
43
223
306
69
217
464
205
72
33
497
574
573
81
523
452
580
397
239
222
202
505
465
103
479
403
294
319
334
210
128
59
275
536
70
168
541
186
527
557
152
502
322
96
462
125
142
4
242
384
353
109
575
13
193
5
389
515
76
181
493
299
463
312
441
39
324
456
45
290
184
141
150
487
220
301
18
12
122
187
439
478
419
276
326
89
579
160
56
437
543
58
443
570
492
374
215
381
40
344
84
583
297
121
182
251
51
54
161
496
236
24
49
521
444
485
379
111
376
270565
471
74
314
252
19
530
20
120
346
216
428
300
567
26
569
188
255
174
194
421
88
368
158
280
385
345
107
350
145
191
425
264
131
35
279
336
82
94
41
361
472
360
176
211
321
474
449
146
436
100
500
238
393
95
221
330
476
315
387
248
79
477
435
366 273
208
572
331
507
391
455
6480
36
180
245
283
469
291
355
447
201
488
86
271
313
134 340
135
564
-100
0
539
100
200
e( gest | X )
300
400
coef = 5.6762662, se = 1.1303444, t = 5.02
10 April 2020
H.S.
21
Removing outlier
10 April 2020
H.S.
22
6000
Influence
5000
Regression
without outlier
4000
Regression with outlier
2000
3000
Outlier
200
10 April 2020
300
400
500
Gestational age
H.S.
600
700
23
Final model
y   0  1 x1   2 x2   ,   N (0,  2 )
Give meaning to constant term:
sum gest
generate gest2=gest-204
generate sex2=sex-1
regress weight gest2 sex2
estimates store m4
10 April 2020
/* find smallest value */
/* smallest gest=204 */
/* boys=0, girls=1 */
/* final model */
H.S.
24
Logistic regression
Being bullied
10 April 2020
H.S.
25
Model and assumptions
• Model

ln(
)   0  1 x1   2 x2 ,   E ( y | x ), y | x ~ Binomial
1 
• Assumptions
– Independent residuals
– Linear effects
10 April 2020
H.S.
26
Association measure, Odds ratio
Model:
Start with:
  
  β0  β1 x1  β2 x2
ln 
 1- 
 Odds x1  2 

ln (OR1 )  ln 
 Odds x 1 
1


ln (Odds x1  2 )  ln (Odds x1 1 )
 β0  2 β1  β2 x2
 β0  1β1  β2 x2

Hence:
10 April 2020
β1
OR1  exp (β1 )
H.S.
27
Syntax
• Estimation
– logistic y x1 x2
– xi: logistic y x1 i.c1
logistic regression
categorical c1
• Post estimation
– predict yf, pr
predict probability
• Manage models
– estimates store m1
– est table m1, eform
10 April 2020
save model
show OR
H.S.
28
Workflow
• Bivariate analysis
• Regression
– Model fitting
• Cofactors in/out
• Interactions
– Test of assumptions
• Independent errors
• Linear effects
– Influence (robustness)
10 April 2020
H.S.
29
Bivariate
Generate dummies
gen Island=
gen Norway=
gen Finland=
gen Denmark=
10 April 2020
(country==2) if country<.
(country==3)
(country==4)
(country==5)
H.S.
30
Model 1: outcome and exposure
Alternative commands:
xi:logistic bullied i.country
use xi: i.var for categorical variables
xi:logistic bullied i.country , coef
coefs instead of OR's
xi:logistic bullied i.country if sex!=. & age!=. do if sex and age not missing
10 April 2020
H.S.
31
Model 2: Add confounders
Estimate associations: m1=m2
Predict:
m2 best
10 April 2020
H.S.
32
Interaction
Model:
Start with:
  
  β0  β1 x1  β2 sex  β3 x1sex
ln 
 1- 
 Odds x1  2 

ln (OR1 )  ln 
 Odds x 1 
1


ln (Odds x1  2 )  ln (Odds x1 1 )
 β0  2 β1  β2 sex  β3 2sex
 β0  1β1  β2 sex  β31sex

Hence:
10 April 2020
β1  β3 sex
OR1  exp (β1  β3 sex)
H.S.
33
Model 3: interaction
10 April 2020
H.S.
34
Test of assumptions
• Linear effects (of age)
.4
0
.2
coefficient
.6
.8
– findit lincheck
search and install
– lincheck xi:logistic bullied age I.country sex
5
10 April 2020
10
age
H.S.
15
35
Points with high influence
restore best model
probability (mu in our notation)
delta-beta (one value, not one per estimate)
delta-beta plot
0
.2
.4
Pregibon's dbeta
.6
.8
estimates restore m2
predict p, p
predict db, db
scatter db p
.05
10 April 2020
.1
.15
.2
Pr(bullied)
H.S.
.25
.3
36
Removing 2 observations
Conclusion:
Robust results
10 April 2020
H.S.
37
Generalized Linear Models
Being bullied
10 April 2020
H.S.
38
Designs and measures
Natural association
measures
Design
Frequency measures
Cross-sectional
Prevalence
Cohort
Incidence risk
Incidence rate
Case Control
Traditional CC
Case Cohort
Nested Case Control
Risk Ratio, Risk Difference
Risk Ratio, Risk Difference
Rate Ratio
Odds Ratio
Rate Ratio
Rate Ratio
Models
GLM
Survival
10 April 2020
Measures
RR, RD, OR
Rate Ratio
H.S.
39
2500 3000 3500 4000 4500 5000
Generalized Linear Models, GLM
Linear regression
250
270
290
310
.6
.8
1
gestational age (days)
0
.2
.4
risk
Logistic regression
20
40
age
60
80
10
15
0
0
5
Poisson regression
0
20
Apr-20
40
60
80
H.S.
40
GLM: Distribution and link
• Distribution family
– Given by data
– Influence p-value, CI
Normal
Binomial
Poisson
– May chose
– Shape (=link-1)
– Scale
Identity
Logit
Log
Additive
Multi.
Multi.
– Association measure
RD
OR
RR
• Link function
Apr-20
H.S.
41
Distribution and link examples
Data
Distribution
Standard models
Continuous Normal
Count
Poisson
0/1
Binomial
Other models
0/1
Binomial
0/1
Binomial
Link
Measure
Name
Identity
Log
Logit
RR
OR
Linear
Poisson
Logistic
Log
Identity
RR
RD
Log binomial
Linear binomial
OBS: not for traditional case control data
Link: Identity  linear model  additive scale
Apr-20
H.S.
42
Being bullied, 3 models
glm bullied Island Norway Finland Denmark sex age, family(binomial) link(logit)
glm bullied Island Norway Finland Denmark sex age, family(binomial) link(log)
glm bullied Island Norway Finland Denmark sex age, family(binomial) link(identity)
Apr-20
H.S.
43
Convergence problems
Stop
• If glm does not converge, use:
– poisson y x1 x2, irr robust
– regress y x1 x2,
robust
10 April 2020
H.S.
RR
RD
44
Association measure, RR
Model:
Start with:
ln    β0  β1 x1  β2 x2  ...
  x1 2 

ln (RR1 )  ln 
  x 1 
 1 
ln ( x1 2 )  ln ( x1 1 )
 β0  2 β1  β2 x2  ...
 β0  1β1  β2 x2  ...

Hence:
10 April 2020
β1
RR1  exp (β1 )
H.S.
45
Association measure: RD
Model:
Start with:
  β0  β1 x1  β2 x2  ...
RD1   x1 2   x1 1
 β0  2 β1  β2 x2  ...
 β0  1 β1  β2 x2  ...

Hence:
10 April 2020
β1
RD1  β1
H.S.
46
30
The importance of scale
20
Females
Multiplicative scale
Absolute increase
Relative increase
Females: 30-20=10
Males: 20-10=10
Females: 30/20=1.5
Males: 20/10=2.0
Conclusion:
Same increase for
males and females
Conclusion:
More increase for
males
RD
RR
0
10
Males
Additive scale
T1
10 April 2020
T2
H.S.
47
Stata regression commands
10 April 2020
H.S.
55
• Regression with simple error structure
– regress linear regression (also heteroschedastic errors)
– nl
non linear least squares
• GLM
– logistic logistic regression
– poisson Poisson regression
– binreg binary outcome, OR, RR, or RD effect measures
• Conditional logistc
– clogit
for matched case-control data
• Multiple outcome
– mlogit
– ologit
multinomial logit (not ordered)
ordered logit
• Regression with complex error structure
– xtmixed linear mixed models
– xtlogit random effect logistic
10 April 2020
H.S.
56
• Estimation
Syntax
– regress y x1 x2
– logistic y x1 x2
– xi:regress y x1 i.x2
linear regression
logistic regression
categorical x2
• Manage results
– estimates store m1
– estimates table m1 m2
– estimates stats m1 m2
store results
table of results
statistics of results
• Post estimation
– predict y, xb
– predict res, resid
– lincom b0+2*b3
linear prediction
residuals
linear combination
• Help
– help logistic postestimation
10 April 2020
H.S.
57