Stata 3, Regression Hein Stigum Presentation, data and programs at: http://folk.uio.no/heins/ Apr-20 H.S. 1 Agenda • Linear regression • GLM • Logistic regression • Binary regression • (Conditional logistic) 10 April 2020 H.S. 2 Linear regression Birth weight by gestational age 10 April 2020 H.S. 3 Regression idea 5000 model : y b0 b1 x e 4500 y = outcome 4000 e error, residual 2500 3000 b1 coefficien t , effect of x 3500 x = covariate 240 260 280 gestational age (days) 300 320 model with many cofactors : y b0 b1 x1 b2 x2 e x1 , x 2 = covariate 10 April 2020 H.S. 4 Model and assumptions • Model y 0 1 x1 2 x2 , N (0, 2 ) • Assumptions – Independent errors – Linear effects – Constant error variance 10 April 2020 H.S. 5 Association measure: RD Model: Start with: y β0 β1 x1 β2 x2 RD1 y x1 2 y x1 1 β0 2 β1 β2 x2 β0 1β1 β2 x2 Hence: 10 April 2020 β1 RD1 β1 H.S. 6 Purpose of regression • Estimation – Estimate association between outcome and exposure adjusted for other covariates • Prediction – Use an estimated model to predict the outcome given covariates in a new dataset 10 April 2020 H.S. 7 Adjusting for confounders True value True value Unadjusted estimate Adjusted estimate • Not adjust – Cofactor is a collider – Cofactor is in causal path • May or may not adjust – Cofactor has missing – Cofactor has error 10 April 2020 H.S. 8 Workflow • Scatterplots • Bivariate analysis • Regression – Model fitting • Cofactors in/out • Interactions – Test of assumptions • Independent errors • Linear effects • Constant error variance – Influence (robustness) 10 April 2020 H.S. 9 4000 3000 2000 weight 5000 6000 Scatterplot 200 300 400 500 600 700 gest 10 April 2020 H.S. 10 Syntax • Estimation – regress y x1 x2 – xi: regress y x1 i.c1 linear regression categorical c1 • Post estimation – predict yf, xb predict • Manage models – estimates store m1 10 April 2020 save model H.S. 11 Model 1: outcome+exposure 10 April 2020 H.S. 12 Model 2: Add counfounders Estimate association: m1=m2 Prediction: m2 is best 10 April 2020 H.S. 13 ”Dummies” Assume educ is coded 1, 2, 3 for low, medium and high education Choose low educ as reference Make dummies for the two other categories: generate medium=(educ==2) if educ<. generate high =(educ==3) if educ<. 10 April 2020 H.S. 14 Interaction Model: Start with: y β0 β1 x1 β2 x2 β3 x1sex RD1 y x1 2 y x1 1 β0 2 β1 β2 x2 β3 2 sex β0 1 β1 β2 x2 β31sex Hence: 10 April 2020 β1 β3sex RD1 β1 β3sex H.S. 15 Model 3: with interaction 10 April 2020 H.S. 16 Test of assumptions • Predict y and residuals – predict y, xb – predict res, resid 2000 0 -1000 – independent? – linear? – const. var? 1000 • Plot resid vs y twoway (scatter res y )(qfitci res y) 3200 3400 3600 3800 Linear prediction 10 April 2020 H.S. 17 Violations of assumptions • Dependent residuals .5 1 Mixed models: xtmixed -1 200 220 240 260 gest 280 300 0 -1 -2 regress weigth gest sex, robust res • Non-constant variance 1 2 gen gest2=gest^2 regress weigth gest gest2 sex -.5 0 • Non linear effects 3400 10 April 2020 H.S. 3500 3600 p 3700 18 3800 .2 Measures of influence -.6 -.4 -.2 0 Remove obs 1, see change remove obs 2, see change 1 2 10 Id • Measure change in: – Outcome (y) – Deviance – Coefficients (beta) • Delta beta, Cook’s distance 10 April 2020 H.S. 19 Points with high influence .8 lvr2plot, mlabel(id) .4 .2 Leverage .6 539 0 505 291 134 564 270 500 206 110 79 180 366 35 226 152 105 543 512 211 582 418 96 125 181 256 161 404 49 444 74 78 139 252 19 565 5381 136 56 61 191 176 82 36 106 313 137 269 274 399 480 545 576 33 85 81 254 400 571 70 103 214215 457 498 185 372374 91 112 581 353 217218 493 456 45 167 6 89 220 58 266 492 286 344345 440 38 389 540 485 327 311 190 289 111 439 212 209 437 251 526 95 75 88 476477 359 221 238 165 315 537 279 336 532 200 455 64 173 119 50 60 355 113 201 232 340 86 488 467 9 1112 31 115 129 177 267 323 338 349 351 354 357 363 414 413 448 453 466 483 486 533 538 548 552 1 30 48 66 65 93 153 229 231 253 277 292 416 426 490 7 14 25 32 205 230 241 303 396 464 481 497 546 43 68 99 116 138 249250 375 102 247 296 339 523524 580 126 144 195 304 365 410 420 482 503 235 244 278 284 364 513 568 573 72 98 178 316 403 452 69 151 237 306 446 451 489 42 118 198 210 219 394 409 407408 434 563 175 196 199 574 427 1 168 222 318 406 423 460 275 307 479 502 536 8384 124 154 294 239 295 319 397 401 501 557 202 0 5 130 142 128 108 390 432 511 59 334 422 465 87 109 171 575 157 322 324 424 186 463 527 101 301 384 478 560 13 149 90 290 367368 37 193 312 461 18 160 328 395 441 148 4 431 555 57 2 515 39 172 76 122 203 551 299 326 16 510 562 121 487 554 40 184 496 187 443 141 305 51 143 150 335 468 5 54 281 583 521 570 19 470 27 91 276 56 97 293 182 347 379 225 42 376 348 44 346 24 236 495 530 183 520 23 509 216 127 155 174 246 383 471 224 120 29 158 194 517 20 145 547 385 261 577 314 341 26 107 320 300 156 329 370 522 475 425 259 27 255 569 421 360 528 358 474 298 350 499 3264 94 388 146 525 321 179 192 77 21 131 43 41 100 213 535 170 449 472 393 387 248 273 325 208 484 132 436 519 331 412 572 234 164 262 80 429 378 391 163 438 245 283 405 447 529 258 271 135 28 332 508 553 263 386 310 454 228 189 223 430 514 133 566 104 541 4 462 242 402 3 240 579 369 1 97 380 337 371 428 567 352 188 280 559 361 504 330 516 435 507 506 169 233 469 392 287 62288 0 10 April 2020 445 285 .01 .02 Normalized residual squared H.S. .03 20 Added variable plot: gestational age 2000 avplot gest, mlabel(id) 285 -1000 0 1000 445 288 28467 62 287 392 113 405 529 106 258 60 233 50 438 169 119 232 378 173 429 163 164 262 506 234 532 519 484 412 537 325 516 213 165 504 21 132 77 192 525 170 298 526 343 535 200 388 179 528 475 341 156 320 27 61 499 358 559 547 517 246 383 556 359 75 259 261 352 577 522 329 23 495 183 370 155 512 29 44 139 371 225 509 524 78 127 212 224 520 348 190 327 293 311 380 337 250 554 408 143 440 404 256 218 347 281 203 286 110 136 468 172 470 197 562 510 16 266 209 542 335 431 369 328 57 148 305 367 289 101 167 240 395 3491 227 555 157 87 390 130 37 424 540 38 461 90 551 402 154 149 560 401 422 501 406 581 1295 108 432 372 133 83 124 185 511 104 430 307 451 171 566 514 446 151 418 409 98 460 219 237 400 423 126 6 427 196 199 394 178 247 318 489 284 365 571 118 195 254 407 198 304 503 138 396 206 420 226 278 241 230 30 137 274 582 235 153 231 253 546 399 545 10 453 99 105316 448 482 339 296 7 269 454 349 129 249 14 32 263 386 414 303 357 533 490 66 277 31 332 229 576 85 292 416 413 115 177 508 323 548 363 481 267 354 483 25 552 466 553 338 486 538 351 9 214 457 498 480 11 102 426 310 410 112 144 48 93 68 228 513 189 244 364 568 563 65 375 434 42 175 91 116 43 223 306 69 217 464 205 72 33 497 574 573 81 523 452 580 397 239 222 202 505 465 103 479 403 294 319 334 210 128 59 275 536 70 168 541 186 527 557 152 502 322 96 462 125 142 4 242 384 353 109 575 13 193 5 389 515 76 181 493 299 463 312 441 39 324 456 45 290 184 141 150 487 220 301 18 12 122 187 439 478 419 276 326 89 579 160 56 437 543 58 443 570 492 374 215 381 40 344 84 583 297 121 182 251 51 54 161 496 236 24 49 521 444 485 379 111 376 270565 471 74 314 252 19 530 20 120 346 216 428 300 567 26 569 188 255 174 194 421 88 368 158 280 385 345 107 350 145 191 425 264 131 35 279 336 82 94 41 361 472 360 176 211 321 474 449 146 436 100 500 238 393 95 221 330 476 315 387 248 79 477 435 366 273 208 572 331 507 391 455 6480 36 180 245 283 469 291 355 447 201 488 86 271 313 134 340 135 564 -100 0 539 100 200 e( gest | X ) 300 400 coef = 5.6762662, se = 1.1303444, t = 5.02 10 April 2020 H.S. 21 Removing outlier 10 April 2020 H.S. 22 6000 Influence 5000 Regression without outlier 4000 Regression with outlier 2000 3000 Outlier 200 10 April 2020 300 400 500 Gestational age H.S. 600 700 23 Final model y 0 1 x1 2 x2 , N (0, 2 ) Give meaning to constant term: sum gest generate gest2=gest-204 generate sex2=sex-1 regress weight gest2 sex2 estimates store m4 10 April 2020 /* find smallest value */ /* smallest gest=204 */ /* boys=0, girls=1 */ /* final model */ H.S. 24 Logistic regression Being bullied 10 April 2020 H.S. 25 Model and assumptions • Model ln( ) 0 1 x1 2 x2 , E ( y | x ), y | x ~ Binomial 1 • Assumptions – Independent residuals – Linear effects 10 April 2020 H.S. 26 Association measure, Odds ratio Model: Start with: β0 β1 x1 β2 x2 ln 1- Odds x1 2 ln (OR1 ) ln Odds x 1 1 ln (Odds x1 2 ) ln (Odds x1 1 ) β0 2 β1 β2 x2 β0 1β1 β2 x2 Hence: 10 April 2020 β1 OR1 exp (β1 ) H.S. 27 Syntax • Estimation – logistic y x1 x2 – xi: logistic y x1 i.c1 logistic regression categorical c1 • Post estimation – predict yf, pr predict probability • Manage models – estimates store m1 – est table m1, eform 10 April 2020 save model show OR H.S. 28 Workflow • Bivariate analysis • Regression – Model fitting • Cofactors in/out • Interactions – Test of assumptions • Independent errors • Linear effects – Influence (robustness) 10 April 2020 H.S. 29 Bivariate Generate dummies gen Island= gen Norway= gen Finland= gen Denmark= 10 April 2020 (country==2) if country<. (country==3) (country==4) (country==5) H.S. 30 Model 1: outcome and exposure Alternative commands: xi:logistic bullied i.country use xi: i.var for categorical variables xi:logistic bullied i.country , coef coefs instead of OR's xi:logistic bullied i.country if sex!=. & age!=. do if sex and age not missing 10 April 2020 H.S. 31 Model 2: Add confounders Estimate associations: m1=m2 Predict: m2 best 10 April 2020 H.S. 32 Interaction Model: Start with: β0 β1 x1 β2 sex β3 x1sex ln 1- Odds x1 2 ln (OR1 ) ln Odds x 1 1 ln (Odds x1 2 ) ln (Odds x1 1 ) β0 2 β1 β2 sex β3 2sex β0 1β1 β2 sex β31sex Hence: 10 April 2020 β1 β3 sex OR1 exp (β1 β3 sex) H.S. 33 Model 3: interaction 10 April 2020 H.S. 34 Test of assumptions • Linear effects (of age) .4 0 .2 coefficient .6 .8 – findit lincheck search and install – lincheck xi:logistic bullied age I.country sex 5 10 April 2020 10 age H.S. 15 35 Points with high influence restore best model probability (mu in our notation) delta-beta (one value, not one per estimate) delta-beta plot 0 .2 .4 Pregibon's dbeta .6 .8 estimates restore m2 predict p, p predict db, db scatter db p .05 10 April 2020 .1 .15 .2 Pr(bullied) H.S. .25 .3 36 Removing 2 observations Conclusion: Robust results 10 April 2020 H.S. 37 Generalized Linear Models Being bullied 10 April 2020 H.S. 38 Designs and measures Natural association measures Design Frequency measures Cross-sectional Prevalence Cohort Incidence risk Incidence rate Case Control Traditional CC Case Cohort Nested Case Control Risk Ratio, Risk Difference Risk Ratio, Risk Difference Rate Ratio Odds Ratio Rate Ratio Rate Ratio Models GLM Survival 10 April 2020 Measures RR, RD, OR Rate Ratio H.S. 39 2500 3000 3500 4000 4500 5000 Generalized Linear Models, GLM Linear regression 250 270 290 310 .6 .8 1 gestational age (days) 0 .2 .4 risk Logistic regression 20 40 age 60 80 10 15 0 0 5 Poisson regression 0 20 Apr-20 40 60 80 H.S. 40 GLM: Distribution and link • Distribution family – Given by data – Influence p-value, CI Normal Binomial Poisson – May chose – Shape (=link-1) – Scale Identity Logit Log Additive Multi. Multi. – Association measure RD OR RR • Link function Apr-20 H.S. 41 Distribution and link examples Data Distribution Standard models Continuous Normal Count Poisson 0/1 Binomial Other models 0/1 Binomial 0/1 Binomial Link Measure Name Identity Log Logit RR OR Linear Poisson Logistic Log Identity RR RD Log binomial Linear binomial OBS: not for traditional case control data Link: Identity linear model additive scale Apr-20 H.S. 42 Being bullied, 3 models glm bullied Island Norway Finland Denmark sex age, family(binomial) link(logit) glm bullied Island Norway Finland Denmark sex age, family(binomial) link(log) glm bullied Island Norway Finland Denmark sex age, family(binomial) link(identity) Apr-20 H.S. 43 Convergence problems Stop • If glm does not converge, use: – poisson y x1 x2, irr robust – regress y x1 x2, robust 10 April 2020 H.S. RR RD 44 Association measure, RR Model: Start with: ln β0 β1 x1 β2 x2 ... x1 2 ln (RR1 ) ln x 1 1 ln ( x1 2 ) ln ( x1 1 ) β0 2 β1 β2 x2 ... β0 1β1 β2 x2 ... Hence: 10 April 2020 β1 RR1 exp (β1 ) H.S. 45 Association measure: RD Model: Start with: β0 β1 x1 β2 x2 ... RD1 x1 2 x1 1 β0 2 β1 β2 x2 ... β0 1 β1 β2 x2 ... Hence: 10 April 2020 β1 RD1 β1 H.S. 46 30 The importance of scale 20 Females Multiplicative scale Absolute increase Relative increase Females: 30-20=10 Males: 20-10=10 Females: 30/20=1.5 Males: 20/10=2.0 Conclusion: Same increase for males and females Conclusion: More increase for males RD RR 0 10 Males Additive scale T1 10 April 2020 T2 H.S. 47 Stata regression commands 10 April 2020 H.S. 55 • Regression with simple error structure – regress linear regression (also heteroschedastic errors) – nl non linear least squares • GLM – logistic logistic regression – poisson Poisson regression – binreg binary outcome, OR, RR, or RD effect measures • Conditional logistc – clogit for matched case-control data • Multiple outcome – mlogit – ologit multinomial logit (not ordered) ordered logit • Regression with complex error structure – xtmixed linear mixed models – xtlogit random effect logistic 10 April 2020 H.S. 56 • Estimation Syntax – regress y x1 x2 – logistic y x1 x2 – xi:regress y x1 i.x2 linear regression logistic regression categorical x2 • Manage results – estimates store m1 – estimates table m1 m2 – estimates stats m1 m2 store results table of results statistics of results • Post estimation – predict y, xb – predict res, resid – lincom b0+2*b3 linear prediction residuals linear combination • Help – help logistic postestimation 10 April 2020 H.S. 57