BSTA 670 – Statistical Computing 27 October 2010 Lecture 14: Random Number Generation Presented by: Paul Wileyto, Ph.D. A good analog random number generator. Anyone who uses software to produce random numbers is in a state of sin. John von Neumann Why do we need Random Numbers? • Simulation Input – Statistical Sampling • Assignment in Trials • Games Where do you get random numbers? • Uniform Random Numbers – Published Tables – Make Them • Computer Algorithms • Harvest from Nature • Random draws from a specific distribution – Make them from Uniform Random Numbers Software Random Number Generators • There are no true random number generators but • There are Pseudo-Random Number Generators – Computers have only a limited number of bits to represent a number – Sooner or later, the sequence of random numbers will repeat itself (period of the generator) – The trick is to be “good enough” to look like random numbers Algorithms for Uniform Random Numbers Good pseudo-random numbers: • Independent of the previous number • Long period • Sequence reproducible if started with same initial conditions • Fast Good pseudo-random numbers: • Equal probability for any number inside interval [a,b] Probability Density: 1 , a xb f ( x) b a 0, x a, x b We are interested primarily in uniform random numbers in the interval [0,1]. • We’ll refer to the realization of a uniform random number over [0,1] as U. • Many of the algorithms produce integer valued random numbers over interval [0,b]. – Transform to interval [0,1] Linear Congruential Generator (LCG) Mod in SAS Most common X n ( X n-1 c) mod m X o = seed, modulus m (large prime), muliplier , and increment c Repeats due to the modular arithmetic that forces wrapping of values into the desired range. proc iml; /* begin IML session */ q={20,30,40,50,70,90,160}; t=mod(q,7); qt=q||t; print qt; /* print matrix */ quit; qt 20 30 40 50 70 90 160 6 2 5 1 0 6 6 SAS Linear Congruential Generator (LCG) Mod in R Most common X n ( X n-1 c) mod m X o = seed, modulus m (large prime), muliplier , and increment c Repeats due to the modular arithmetic that forces wrapping of values into the desired range. q<-matrix(seq(10,100, by=10),10,1) qm=q%%13 qall<-cbind(q,qm) qall [,1] [,2] [1,] 10 10 [2,] 20 7 [3,] 30 4 [4,] 40 1 [5,] 50 11 [6,] 60 8 [7,] 70 5 [8,] 80 2 [9,] 90 12 [10,] 100 9 R Unit Random Variates in SAS proc iml; /* begin IML session */ seed = 123456; c = j(5,1,seed); b = uniform(c); print b; quit; b 0.73902 0.2724794 0.7095326 0.3191636 0.367853 SAS Unit Random Variates in R RNGkind() [1] "Mersenne-Twister" "Inversion" set.seed(as.integer(format(Sys.time(), "%S%M%H"))) c<-matrix(runif(5),5,1) c [,1] [1,] 0.9919911 [2,] 0.2598466 [3,] 0.1818524 [4,] 0.3357782 [5,] 0.2754353 R Unit Random Variates in SAS proc iml; /* begin IML session */ seed = 0; c = j(5,1,seed); b = uniform(c); Set seed to 0 to print b; grab a seed value quit; from the system clock. b 0.73902 0.2724794 0.7095326 0.3191636 0.367853 SAS RANUNI() and IML UNIFORM() use a multiplicative linear congruential generator (from SAS docs) where SEED = mod( SEED * 397204094, 2**31-1 ) and then returns SEED / (2**31-1) SAS Testing Randomness • Is it Uniform? 4 300 3 250 2.5 200 2 150 1.5 100 1 50 0.5 0 0 0.2 0.4 0.6 0.8 1 x 10 0 0 0.2 0.4 0.6 0.8 1 .8 .6 x .4 .2 0 0 .2 .4 .6 .8 1 .6 .8 1 .2 .4 x1 .6 .8 1 y 0 • Generate two sets and plot against each other • Might see correlation in higher dimensions • Plot Xi versus Xi+k for serial correlation 1 Testing Randomness 0 .2 .4 y2 Linear Congruential Generator • The good – Fast – Up to period of m random numbers • The Bad – Sequential correlation • Plots in more than 1 dimension do not fill in the space uniformly, but tend to form bands • Not cryptographically secure – Selections of m, , and c are important Linear Congruential Generator • Good “magic” number for linear congruent method: – a = 16,807, c = 0, M = 2,147,483,647 Overflow Method for integers I j 1 aI j c • Multiply two 32-bit numbers to get a 64 bit integer, that cannot be represented in 32-bit space. • Low order 32 bits remain after the overflow. • Divide by 232 to get floating point values between 0 and 1. • Very Fast Blum, Blum, Shub X n 1 X n mod M , M pq, 2 where p and q are large primes • Very slow – Not suited to simulation • Passes all tests • Cryptographically secure Mersenne Twister • By Matsumoto and Nishimura (1997) • Caused a great deal of excitement in 1997. • Good statistical properties • Not good for cryptography • SAS IML RANDGEN function • Default technique for R runif() Mersenne Twister • I’m just going to give you the flavor of it • It’s a bit shifting algorithm 32 bit word: 0 1 0 0 0 1 … 1 1 Mersenne Twister • XOR – Logical bitwise comparison function – Compares two bits • If they are different, value is 1 • If they are the same, value is zero >> a=[0 0 1 1] a = 0 0 1 1 1 0 0 1 >> b=[0 1 1 0] b = 0 1 >> c=xor(a,b) c = 0 >> 1 MATLAB Mersenne Twister • XOR – Logical bitwise comparison function – Compares two bits • If they are different, value is 1 • If they are the same, value is zero > a<-c(0, 0, 1, 1) > b<-c(0, 1, 1, 0) > c<-xor(a,b) > c [1] FALSE TRUE FALSE TRUE > as.integer(c) [1] 0 1 0 1 R Mersenne Twister • Bit shifting algorithm – Use XOR function to flip values 32 bit word: 1 1 0 0 0 XOR 1 … 1 1 Mersenne Twister • Use 624 32 bit words to make one 19937 bit word (623*32 + 1) – XOR flip function in each 32-bit word To next word 32 bit word: 0 From last word 1 0 0 0 XOR 1 … 1 1 Mersenne Twister From: John Savard’s Cryptology Page http://www.quadibloc.com Mersenne Twister • By Matsumoto and Nishimura (1997) • Mersenne Prime Numbers (powers of 2 – 1) give period length: 219937-1 for 32 bit numbers • Free C source code • Fast • Passes all randomness smell tests • Not cryptographically secure proc iml; /* begin IML session */ r = j(10,1,.); call randgen(r,'uniform'); print r; quit; r 0.0151013 0.5743561 0.5829185 0.6437729 0.1823678 0.3977417 0.476881 0.9845982 0.3211301 0.9623223 SAS > RNGkind() [1] "Mersenne-Twister" "Inversion" > r=matrix(runif(10), 10,1) >r [,1] [1,] 0.14645262 [2,] 0.04558767 [3,] 0.79254901 [4,] 0.57810786 [5,] 0.57831079 [6,] 0.30258424 [7,] 0.08682622 [8,] 0.77980499 [9,] 0.34161593 [10,] 0.98705945 R Grabbing a Seed from the System Clock (SAS) Both R and SAS automatically grab a seed value from the system clock at first use, unless you call set.seed (in R) or randseed (in SAS) to set a specific starting point SAS proc iml; /* begin IML session */ call randseed(12345); r = j(10,1,.); call randgen(r,'uniform'); print r; quit; r 0.5832971 0.9936254 0.5878877 0.8574689 0.8246889 0.2805668 0.6473969 0.3819192 0.4489572 0.8757847 SAS > set.seed(12345) > r=matrix(runif(10), 10,1) >r [,1] [1,] 0.7209039 [2,] 0.8757732 [3,] 0.7609823 [4,] 0.8861246 [5,] 0.4564810 [6,] 0.1663718 [7,] 0.3250954 [8,] 0.5092243 [9,] 0.7277053 [10,] 0.9897369 R Obtaining Random Numbers from Specific Distributions • Inverse Probability Transform Methods • Rejection Methods • Mixed Rejection and Transform • Methods for Correlated Random Numbers Obtaining Random Numbers from Specific Distributions • Inverse Probability Transform methods – Let X be a random variable described by CDF F(X) – We wish to generate values of X distributed according to F(X). – Given a continuous Uniform Random Variable U, in [0,1], the Random Variable X=F-1(U). F 1 (u ) inf x | F ( x) u, 0 u 1 What this means is: • Solve for X in the CDF, so that when you plug in U for F, you get a random number from that specific distribution. Example: Exponential Distribution f(x)= e- x I (0, ) ( x) , F ( x) 1 e x I x (0, ) Let y 1 e x Solving for x x 1 x So, F ( y ) log(1 y ) log(1 y ) which means that as Exponential ( ). log(1 U ) is an rv distributed proc iml; /* begin IML session */ u = j(1000,1,.); call randgen(u,'uniform'); exrand=-log(1-u)/.04; tbl=u||exrand; print tbl; varnames={"u","erand"}; create erand from tbl [colname=varnames]; append from tbl; quit; proc means data=erand; var u erand; run; title 'Analysis Exponential RVs'; proc univariate data=erand noprint; histogram erand / midpoints=5 to 205 by 10 exp; run; tbl 0.115794 0.754043 0.157732 0.0431113 0.1086405 0.2632565 0.9448316 0.3589581 0.7109185 0.4665676 3.0766303 35.064963 4.2914261 1.1017045 2.8751851 7.6378865 72.434124 11.116512 31.026164 15.710572 The MEANS Procedure Variable N Mean Std Dev Minimum Maximum u 1000 0.4819 0.2877 0.0010 0.9996 erand 1000 23.4993 23.7694 0.0240 194.5330 >r=matrix(runif(10000), 10000,1) >exrand=-log(1-r)/.04 >hist(exrand, freq = FALSE) > mean(exrand) [1] 24.55222 > 1/mean(exrand) [1] 0.04072951 > hist(exrand, freq = FALSE)> help.search("means") R 0.010 0.000 Density 0.020 Histogram of exrand 0 50 100 150 200 250 exrand R Weibull Survival Survival Time: S ( U ) exp t ln U Inverse Prob Transform: t 1.5, 1 0.001 SAS proc iml; /* begin IML session */ u = j(2000,1,.); call randgen(u,'uniform'); wrand=(-log(1-u)/.001)##(1/1.5); tbl=u||wrand; print tbl; varnames={"u","weibrand"}; create wrand from tbl [colname=varnames]; append from tbl; Quit; proc means data=wrand; var u weibrand; run; title 'Analysis of Weibull RVs'; proc univariate data=wrand noprint; histogram weibrand / midpoints=5 to 205 by 10 weibull; run; SAS SAS Weibull Survival Survival Time: S ( U ) exp t ln U Inverse Prob Transform: t 1.5, 1 0.001 R > r=matrix(runif(10000), 10000,1) > wrand=(-log(1-r)/.001)^(1/1.5) > hist(wrand, freq = FALSE, main = paste("Histogram of Survival Times"), breaks=50, xlab = "Survival Time") R R proc lifetest data=Work.Wrand method=pl OUTSURV=work._surv; time WEIBRAND * CENS (0); run; quit; goptions reset=all device=WIN; data work._surv; set work._surv; if survival > 0 then _lsurv = -log(survival); if _lsurv > 0 then _llsurv = log(_lsurv); run; ** Survival plots **; goptions reset=symbol; goptions ftext=SWISS ctext=BLACK htext=1 cells; proc gplot data=work._surv ; label weibrand = 'Survival Time'; axis2 minor=none major=(number=6) label=(angle=90 'Survival Distribution Function'); symbol1 i=stepj c=BLUE l=1 width=1; plot survival * weibrand=1 / description="SDF of weibrand" frame cframe=CXF7E1C2 caxis=BLACK vaxis=axis2 hminor=0 name='SDF'; run; symbol1 i=join c=BLUE l=1 width=1; quit; goptions ftext= ctext= htext= reset=symbol; SAS SAS > r=matrix(runif(1000), 1000,1) > wrand=(-log(1-r)/.001)^(1/1.5) > event=wrand<=200 > wrand2=wrand*(event)+200*(1-event) > fit <- survfit(Surv(wrand2, event) ~ 1, data = aml) > plot(fit, lty = 2:3,xlab = "Days", ylab="Survival") > R R Simulating Weibull Regression Data, with Proportional Hazards Survival Time: S ( P) exp t , exp( x) ln P Inverse Prob Transform: t 1 1.5, exp( x) 0 ln(0.001) 2.30 HR( Drug ) 0.5, 1 0.69 SAS proc iml; /* begin IML session */ u = j(400,1,.); d = j(200,1,0) // j(200,1,1); call randgen(u,'uniform'); wrand=(-log(1-u)/exp(log(.001)-0.69*d))##(1/1.5); c = wrand<=200; wrand=wrand##c + 200*(1-c); tbl=u || wrand || d || c ; print tbl; varnames={"u","weibrand","treat", "cens"}; create wrand from tbl [colname=varnames]; append from tbl; Data, drug treatment, 0,1) Constant Value, drug effect * drug quit; SAS options pageno=1; proc lifetest data=Work.Wrand method=pl OUTSURV=work._surv; time WEIBRAND * CENS (0); strata TREAT; run; quit; goptions reset=all device=WIN; data work._surv; set work._surv; if survival > 0 then _lsurv = -log(survival); if _lsurv > 0 then _llsurv = log(_lsurv); run; ** Survival plots **; title; footnote; goptions reset=symbol; goptions ftext=SWISS ctext=BLACK htext=1 cells; proc gplot data=work._surv ; label weibrand = 'Survival Time'; axis2 minor=none major=(number=6) label=(angle=90 'Survival Distribution Function'); symbol1 i=stepj l=1 width=1; symbol2 i=stepj l=2 width=1; symbol3 i=stepj l=3 width=1; plot survival * weibrand = treat / description="SDF of weibrand by treat" frame cframe=CXF7E1C2 caxis=BLACK vaxis=axis2 hminor=0 name='SDF'; run; symbol1 i=join l=1 width=1; symbol2 i=join l=2 width=1; symbol3 i=join l=3 width=1; quit; SAS SAS *** Proportional Hazards Models *** ; options pageno=1; proc phreg data=Work.Wrand; model WEIBRAND * CENS (0) = TREAT; run; quit; Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 202.5356 209.0491 201.2807 1 1 1 <.0001 <.0001 <.0001 Analysis of Maximum Likelihood Estimates Parameter treat DF Parameter Estimate Standard Error Chi-Square Pr > ChiSq Hazard Ratio 1 -0.70413 0.04963 201.2807 <.0001 0.495 SAS Simulating Weibull Regression Data, with Proportional Hazards Survival Time: S ( P) exp t , exp( x) ln P Inverse Prob Transform: t 1 1.5, exp( x) 0 ln(0.001) 2.30 HR( Drug ) 0.5, 1 0.69 R Data, drug treatment (0,1) r=matrix(runif(400), 400,1) Constant Value drug=rbind(matrix(1,200,1),matrix(0,200,1)) Drug effect * drug wrand=(-log(1-r)/exp(log(.001)-0.69*drug))^(1/1.5) event = wrand<=200; wrand=wrand*event + 200*(1-event) survreg(Surv(wrand, event) ~ drug, dist='weibull', model=TRUE, scale=1) Call: survreg(formula = Surv(wrand, event) ~ drug, dist = "weibull", scale = 1, model = TRUE) Coefficients: (Intercept) drug 4.6042014 0.5978962 Scale fixed at 1 Loglik(model)= -1918.1 Loglik(intercept only)= -1932.6 Chisq= 29.06 on 1 degrees of freedom, p= 7e-08 n= 400 lsurv2 <- survfit(Surv(wrand, event) ~ drug, aml, type='fleming') plot(lsurv2, lty=2:3,xlab = "Days", ylab="Survival") Package survival R R Package eha > enter=matrix(0,400,1) > fit <- phreg(Surv(enter, wrand, event) ~ drug) > fit Call: phreg(formula = Surv(enter, wrand, event) ~ drug) Covariate drug W.mean 0.586 log(scale) log(shape) Events Total time at risk Max. log. likelihood LR test statistic Degrees of freedom Overall p-value Coef Exp(Coef) -0.731 0.481 4.663 0.402 105.903 1.495 se(Coef) 0.113 Wald p 0.000 0.050 0.047 0.000 0.000 327 44359 -1886.9 42.4 1 7.38224e-11 > plot.phreg(fn="sur“) R R Generating Numbers from Specific Distributions • Rejection Method – Fast – Good for Count Models – Good when you cannot find F-1 , but have f(x) – Generally Use Pairs of Random Numbers – Just like playing the game “Battleship” The Rejection Method is Like Playing the Game Battleship Rejection • Choose pairs of uniform random numbers – xU between Xmin and Xmax – yU between Ymin and Ymax • Reject xU if yU > f(x) at xU Rejection Ymax Miss Miss f(x) Hit Ymin Xmin Xmax Sample the area (two dimensions) containing the probability distribution or density function uniformly. Rejection • Simple version becomes inefficient if the rejection area is large. Large Dead Zone g(x) Binomial Count: p=0.2 Trials=20 239 In, 761 Rejected 0.25 Frequency 0.2 0.15 0.1 0.05 0 -5 0 5 10 Count 15 20 25 Matlab Rejection • Can be made more efficient by uniform sampling over a smaller target area. The trick is to sample uniformly over the smaller area. g(x) Smaller Dead Zone Rejection • Can be made more efficient by uniform sampling over a smaller target area. First, define "dominating function" f ( x), and corresponding integral or Cumulative Distribution F ( x). F ( x) need not be normalized. g(x) Smaller Dead Zone f(x) Rejection • Can be made more efficient by uniform sampling over a smaller target area. g(x) f ( x) a bx , 0 x a b b F ( x) ax x 2 2 2 a F (a b ) 2b a b xMax Smaller Dead Zone f(x) Rejection • Choose xU based on inverse transform of the integrated dominance function (F(x)). a2 Choose a uniform random number U1 in the range: 0 U 1 2b Calculate x by setting F(x)=U, and solving (the quadratic) for x. g(x) f(x) xU Rejection • Evaluate f(x), choose a second uniform random number U2 between 0 and f(x). • Reject if U2 >f(x) g(x) f(x) xU 315 In, 685 Rejected 0.25 Frequency 0.2 0.15 0.1 0.05 0 -5 0 5 10 Count 15 20 25 Weibull Function? x x 1 f ( x) exp x F ( x) 1 exp 1 (u ) F x ln u 1 times 2 , 6.5, 1.8 574 In, 426 Rejected 0.25 Frequency 0.2 0.15 0.1 0.05 0 -5 0 5 10 Count 15 20 25 Binomial Distribution (Bernoulli Trials are the simplest example of the rejection method.) Probability Pr(X=1): p >>proc iml; /* begin IML session */ r = j(10,1,.); call randgen(r,'uniform'); b=r>0.5; print b; quit; b 1 0 1 1 0 0 0 0 0 1 SAS > r=matrix(runif(10), 10,1) > b=r<=0.5 > cbind(r,b) [,1] [,2] [1,] 0.4919652 1 [2,] 0.5088624 0 [3,] 0.5955355 0 [4,] 0.5243394 0 [5,] 0.5923056 0 [6,] 0.1610980 1 [7,] 0.9663659 0 [8,] 0.2548106 1 [9,] 0.4582953 1 [10,] 0.1170421 1 > R • But then, you never have just one value of “p” for your Bernoulli Trials Simulating Outcomes from a Logistic Model • Placebo Controlled Drug Trial – 25% Success for Placebo – Odds Ratio of 2.0 for Treatment • Two different success probabilities, based on logistic model Logistic Model Placebo: exp 0 , 0 1.0986 0.25 1 exp 0 Drug (0,1): OR=2.0, ln(OR)=0.6931 exp 1.0986 0.6931* Drug CDF Success 1 exp 1.0986 0.6931* Drug proc iml; /* begin IML session */ u = j(400,1,.); d = j(400,1,1)||(j(200,1,0) // j(200,1,1)); bta= {-1.0986 , 0.6931}; call randgen(u,'uniform'); expit=exp(d*bta)/(1+exp(d*bta)); outcome=u<=expit; tbl=u || d || expit || outcome ; varnames={"u","const","treat", "expit","outcome"}; create erand from tbl [colname=varnames]; append from tbl; bta1 bta 2 d: 1 0 0 1 quit; proc logistic data=Work.Erand DESCEND; model OUTCOME = TREAT; run; 1 1 SAS The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Intercept treat 1 1 -1.2367 0.5510 0.1693 0.2261 Wald Chi-Square Pr > ChiSq 53.3383 5.9402 <.0001 0.0148 Odds Ratio Estimates Effect treat Point Estimate 1.735 95% Wald Confidence Limits 1.114 2.702 SAS r=matrix(runif(400), 400,1) drug=rbind(matrix(0,200,1),matrix(1,200,1)) d=cbind(matrix(1,400,1), drug) parms=matrix(c(-1.0986 , 0.6931),2,1) expit=exp(d%*%parms)/(1+exp(d%*%parms)) outcome=r<=expit parm1 parm2 d: 1 0 0 1 1 1 R > drugtrial<-glm(outcome~drug, family = binomial(link="logit")) > summary(drugtrial) Call: glm(formula = outcome ~ drug, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median -1.036 -1.036 -0.776 3Q 1.326 Max 1.641 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.0460 0.1612 -6.488 8.68e-11 *** drug 0.7026 0.2158 3.255 0.00113 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 511.49 Residual deviance: 500.67 AIC: 504.67 on 399 on 398 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 R Normally Distributed Random Numbers • Inverse transform methods inefficient for normal random numbers • Box-Muller Transform – z transformation of two random uniform variates [X1,X2~U(0,1)] • Random radius, random Get two z variates from two uniform random numbers, X 1 and X 2 : z1 r cos( ) 2 ln( X 1 ) cos(2 X 2 ) z2 r sin( ) 2 ln( X 1 ) sin(2 X 2 ) Normally Distributed Random Numbers • Box-Muller Transform – X1, X2, specify a position within the unit circle • Random angle, random radius – Would be more efficient if it did not make calls to trigonometric functions. • Marsaglia Method – Places the Unit Circle within a square, -1 to +1, and samples the square uniformly. • Rejects draws that fall outside the circle. • But it avoids calls to trig functions. s X12 X 2 2 1 z1 X1 2ln( s) 2ln( s) , z2 X 2 s s Generating Numbers from Specific Distributions • Normal, Using CLT (quick & dirty) – Sum several iterations of u – Standardize • Recall that Var(u)=1/12 12 X ui 6 i 1 Correlated Multivariate Random Numbers • Simulating panel data, repeated measures • Mixture distributions Generating Multivariate Normal Random Numbers Desired Covariance Matrix V V = R'R, R is the Cholesky Decomposition of V Begin with independent standard normal RVs Z N (0,1) Correlated (Multivariate) Normal RVs: X = R'Z μ Generating Multivariate Normal Random Numbers proc iml; /* begin IML session */ rmat={1 .3 .2 .1, .3 1 .3 .2, .2 .3 1 .3 , .1 .2 .3 1}; sigvec={53 36 12 47}; cvmat=rmat#(sigvec`*sigvec); upr=half(cvmat); print rmat; print sigvec; print cvmat; print upr; /* Let’s be wasteful */ r1 = j(1000,4,.); r2 = j(1000,4,.); call randgen(r1,'uniform'); call randgen(r2,'uniform'); pi= 4*atan(1); print pi; quit; z1=sqrt(-2*log(r1))#cos(2*pi*r2); z1=z1*upr; varnames={"x1","x2","x3","x4"}; create nrand from z1 [colname=varnames]; append from z1; proc corr data=work.nrand pearson; var x1 x2 x3 x4; run; SAS rmat 1 0.3 0.2 0.1 0.3 1 0.3 0.2 0.2 0.3 1 0.3 0.1 0.2 0.3 1 12 47 127.2 129.6 144 169.2 249.1 338.4 169.2 2209 sigvec 53 36 cvmat 2809 572.4 127.2 249.1 572.4 1296 129.6 338.4 upr 53 10.8 2.4 4.7 0 34.341811 3.0190603 8.3757958 0 0 11.36333 11.672016 0 0 0 44.503035 SAS The CORR Procedure 4 Variables: x1 x2 x3 x4 Simple Statistics Variable x1 x2 x3 x4 N Mean Std Dev Sum Minimum Maximum 1000 1000 1000 1000 -0.86090 -0.21592 -0.06176 0.46483 51.78291 36.41244 11.60953 46.63351 -860.89502 -215.92386 -61.75755 464.82762 -167.70178 -122.58068 -37.09589 -152.65527 157.51299 120.05335 43.83908 143.41509 Pearson Correlation Coefficients, N = 1000 Prob > |r| under H0: Rho=0 x1 x2 x3 x4 x1 1.00000 0.30338 <.0001 0.20341 <.0001 0.11397 0.0003 x2 0.30338 <.0001 1.00000 0.28186 <.0001 0.24150 <.0001 x3 0.20341 <.0001 0.28186 <.0001 1.00000 0.34421 <.0001 x4 0.11397 0.0003 0.24150 <.0001 0.34421 <.0001 1.00000 SAS Generating Multivariate Normal Random Numbers cmat<-rbind(c(1, .3, .2, .1), c(.3, 1, .3, .2), c(.2, .3, 1, .3) , c(.1, .2, .3, 1)) sigvec=c(53, 36, 12, 47) vv=cmat*(sigvec%*%t(sigvec)) rr=chol(vv) r1=matrix(runif(1000), 250,4) r2=matrix(runif(1000), 250,4) z1=sqrt(-2*log(r1))*cos(2*pi*r2) rvs=z1%*%rr > cmat [,1] [,2] [,3] [,4] [1,] 1.0 0.3 0.2 0.1 [2,] 0.3 1.0 0.3 0.2 [3,] 0.2 0.3 1.0 0.3 [4,] 0.1 0.2 0.3 1.0 > vv [1,] [2,] [3,] [4,] [,1] [,2] [,3] [,4] 2809.0 572.4 127.2 249.1 572.4 1296.0 129.6 338.4 127.2 129.6 144.0 169.2 249.1 338.4 169.2 2209.0 R > rr [1,] [2,] [3,] [4,] [,1] [,2] [,3] [,4] 53 10.80000 2.400000 4.700000 0 34.34181 3.019060 8.375796 0 0.00000 11.363330 11.672016 0 0.00000 0.000000 44.503035 > cov(rvs) [,1] [,2] [,3] [,4] [1,] 2832.4200 561.2585 134.0656 533.7351 [2,] 561.2585 1235.7616 124.2373 382.5441 [3,] 134.0656 124.2373 127.4132 160.2173 [4,] 533.7351 382.5441 160.2173 2205.5903 > cor(rvs) [,1] [,2] [,3] [,4] [1,] 1.0000000 0.2999969 0.2231676 0.2135426 [2,] 0.2999969 1.0000000 0.3130961 0.2317137 [3,] 0.2231676 0.3130961 1.0000000 0.3022317 [4,] 0.2135426 0.2317137 0.3022317 1.0000000 > sd(rvs) [1] 53.22048 35.15340 11.28774 46.96371 R Subject-specific Random Effects th th For the i subject in the j measurement: yij x eij ki • We have an error term (eij) for measurement j in subject i. • We also have a subject specific random effect (ki) Recipe for Subject-specific Random Effects • Create subjects for study –N – Assign treatment, covariates • Give each subject a random effect – Drawn from, say, N(0,V) • Generate predicted values based on regression + random effects • Generate outcomes for each repeated measure from specific distribution Logistic Model Placebo: 0.25 exp 0 , 0 1.0986 1 exp 0 Drug: OR=2.0, ln(OR)=0.6931 Time (0,1,2): OR 1.5, ln(OR)=0.4055 Ki N(0,1) exp 1.0986 0.6931* Drug 0.4055* Time K i CDF Success 1 exp 1.0986 0.6931* Drug 0.4055* Time K i bta1 bta 2 1 proc iml; /* begin IML session */ u = j(600,1,.); d1=j(100,1,0)//j(100,1,1); d1=d1//d1//d1; id=j(200,1,0); do i=1 to 200 by 1; id[i,1]=i; end; id=id//id//id; t=j(200,1,0)//j(200,1,1)//j(200,1,2); k=j(200,1,.); call randgen(k,'normal'); k=k//k//k; bta= {-1.0986 , 0.6931,.4055,1}; d = j(600,1,1)||d1||t||k; call randgen(u,'uniform'); expit=exp(d*bta)/(1+exp(d*bta)); y=u<=expit; id: d: ID 1 0 k 1 ID 0 k 1 ID tbl=id||u || d || expit || y ; 0 k 1 varnames={"id","u","const","treat","t","k", "expit","outcome"}; create erand from tbl [colname=varnames]; append from tbl; quit; SAS . xtlogit outcome treat t, i(id) Random-effects logistic regression Group variable: id Number of obs Number of groups = = 600 200 Random effects u_i ~ Gaussian Obs per group: min = avg = max = 3 3.0 3 Log likelihood = -394.754 Wald chi2(2) Prob > chi2 = = 11.07 0.0040 -----------------------------------------------------------------------------outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------treat | .6023501 .2279107 2.64 0.008 .1556533 1.049047 t | .235297 .1123071 2.10 0.036 .015179 .4554149 _cons | -.9334699 .2040373 -4.57 0.000 -1.333376 -.5335642 -------------+---------------------------------------------------------------/lnsig2u | -.1281394 .3971684 -.9065751 .6502964 -------------+---------------------------------------------------------------sigma_u | .9379396 .18626 .6355353 1.384236 rho | .2109869 .0661172 .1093476 .3680594 -----------------------------------------------------------------------------Likelihood-ratio test of rho=0: chibar2(01) = 13.01 Prob >= chibar2 = 0.000 . di -394.754*(-2) 789.508 Stata, SAS data Finally got this to run in SAS. I had forgotten that SAS requires you to sort. Stata does not require sorted data for their mixed models. proc sort data=erand; by id t; run; proc nlmixed data=erand qpoints=5 ; parms b0=0 b1=-.7 b2=.6 sig=0 ; theta2 = b0+b1*treat+b2*t+u; prb= exp(theta2)/(1+exp(theta2)); model outcome ~ binary(prb); random u ~normal(0,sig) subject=id ; run; SAS The NLMIXED Procedure Fit Statistics -2 Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) 789.5 797.5 797.6 810.7 Parameter Estimates Parameter b0 b1 b2 sig Estimate Standard Error DF t Value Pr > |t| Alpha Lower Upper Gradient -0.9332 0.6021 0.2353 0.8781 0.2039 0.2278 0.1123 0.3480 199 199 199 199 -4.58 2.64 2.10 2.52 <.0001 0.0089 0.0374 0.0124 0.05 0.05 0.05 0.05 -1.3354 0.1529 0.01382 0.1917 -0.5310 1.0513 0.4567 1.5644 0.00039 -0.00007 0.000737 0.000363 We see tiny differences between this and Stata results, owing to differences in optimization specs. SAS id=matrix(seq(1:200), 200,1) k1=matrix(runif(200), 200,1) k2=matrix(runif(200), 200,1) k=sqrt(-2*log(k1))*cos(2*pi*k2) id=rbind(id,id,id) k=rbind(k,k,k) drug =rbind(matrix(0,100,1), matrix(1,100,1)) drug=rbind(drug,drug,drug) d=cbind(matrix(1,600,1),drug,k) parms=matrix(c(-1.0986 , 0.6931,1),3,1) expit=exp(d%*%parms)/(1+exp(d%*%parms)) outcome=r<=expit id: d: ID 1 bta1 bta 2 1 0 k 1 ID 0 k 1 ID 0 k 1 R . xtlogit outcome drug, i(id) Random-effects logistic regression Group variable: id Number of obs Number of groups = = 600 200 Random effects u_i ~ Gaussian Obs per group: min = avg = max = 3 3.0 3 Log likelihood = -368.112 Wald chi2(1) Prob > chi2 = = 14.55 0.0001 -----------------------------------------------------------------------------outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------drug | .8331252 .2183881 3.81 0.000 .4050924 1.261158 _cons | -1.240213 .1708219 -7.26 0.000 -1.575018 -.9054085 -------------+---------------------------------------------------------------/lnsig2u | -.6325958 .5624138 -1.734907 .469715 -------------+---------------------------------------------------------------sigma_u | .7288423 .2049555 .4200198 1.264729 rho | .1390212 .0673177 .050895 .3271436 -----------------------------------------------------------------------------Likelihood-ratio test of rho=0: chibar2(01) = 5.08 Prob >= chibar2 = 0.012 R Got this to run in R, using a mixed effects package called Zelig. z.out1 <- zelig(outcome ~ drug + tag(1 | id),data=NULL, model="logit.mixed") Delia Bailey and Ferdinand Alimadhi. 2007. "logit.mixed: Mixed effects logistic model" in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software," http://gking.harvard.edu/zelig summary(z.out1) Generalized linear mixed model fit by the Laplace approximation Formula: outcome ~ drug + tag(1 | id) AIC BIC logLik deviance 743.4 756.6 -368.7 737.4 Random effects: Groups Name Variance Std.Dev. id (Intercept) 0.39486 0.62838 Number of obs: 600, groups: id, 200 Fixed effects: Estimate Std. Error z value (Intercept) -1.2172 0.1511 -8.054 drug 0.8174 0.2023 4.041 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 Pr(>|z|) 8.02e-16 *** 5.32e-05 *** ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation of Fixed Effects: (Intr) drug -0.747 R General Approach to Correlated Multivariate Random Numbers • “Copulas” – allow us to draw correlated random numbers from different distributions • Random effects in Mixture Models – They use CDF probabilities of correlated variables on the inside to map to correlated uniform random numbers on the margins • Those correlated uniform RVs may be used to marry vastly different distributions. • Maintain Marginal Distributions Generating Multivariate Random Numbers From SAS documentation, a Gaussian Copula • Independent Normal (N(0,1) ) random variables are generated • These variables are transformed to a correlated set of z-scores by using the Cholesky Decomposition of the covariance matrix. • These correlated normal RVs are transformed to a uniform by using (z). • F-1() is used to compute the final sample value Generating Multivariate Random Numbers proc iml; /* begin IML session */ rmat={1 .3 .2 .1, .3 1 .3 .2, .2 .3 1 .3 , .1 .2 .3 1}; sigvec={1 1 1 1}; cvmat=rmat#(sigvec`*sigvec); /* # is element-wise multiplication */ upr=half(cvmat); print rmat; print sigvec; print cvmat; print upr; r1 = j(1000,4,.); r2 = j(1000,4,.); call randgen(r1,'uniform'); call randgen(r2,'uniform'); pi= 4*atan(1); print pi; z1=sqrt(-2*log(r1))#cos(2*pi*r2); /* Note – I could have gotten another z here */ z1=z1*upr; z1=cdf('Normal',z1); z1=gaminv(z1,3.0); /* Standardized gamma parameter, also the mean */ varnames={"x1","x2","x3","x4"}; create nrand from z1 [colname=varnames]; append from z1; quit; proc corr data=work.nrand pearson; var x1 x2 x3 x4; run; SAS rmat 1 0.3 0.2 0.1 0.3 1 0.3 0.2 0.2 0.3 1 0.3 0.1 0.2 0.3 1 1 1 0.2 0.3 1 0.3 0.1 0.2 0.3 1 sigvec 1 1 cvmat 1 0.3 0.2 0.1 0.3 1 0.3 0.2 SAS The CORR Procedure 4 Variables: x1 x2 x3 x4 Simple Statistics Variable x1 x2 x3 x4 N Mean Std Dev Sum Minimum Maximum 1000 1000 1000 1000 2.96320 3.01249 3.00336 3.08106 1.73566 1.68236 1.68803 1.79858 2963 3012 3003 3081 0.11528 0.14039 0.34496 0.11148 12.19072 10.20117 13.72023 13.25409 Pearson Correlation Coefficients, N = 1000 Prob > |r| under H0: Rho=0 x1 x2 x3 x4 x1 1.00000 0.25874 <.0001 0.19005 <.0001 0.10052 0.0015 x2 0.25874 <.0001 1.00000 0.22622 <.0001 0.13944 <.0001 x3 0.19005 <.0001 0.22622 <.0001 1.00000 0.32082 <.0001 x4 0.10052 0.0015 0.13944 <.0001 0.32082 <.0001. 1.00000 SAS Generating Multivariate Random Numbers cmat<-rbind(c(1, .4, .4, .4), c(.4, 1, .4, .4), c(.4, .4, 1, .4) , c(.4, .4, .4, 1)) rr=chol(cmat) r1=matrix(runif(1000), 250,4) r2=matrix(runif(1000), 250,4) z1=rbind(sqrt(-2*log(r1))*cos(2*pi*r2),sqrt(-2*log(r2))*cos(2*pi*r1)) rvs=z1%*%rr cd=pnorm(rvs,mean=0,sd=1) g<-qinvgamma(cd,2,3) corr(cd) [1] 0.4150358 corr(rvs) [1] 0.4188932 corr(g) [1] 0.2337756 R >> U = copularnd('Gaussian',.4,10) U= 0.8017 0.3650 0.8104 0.3467 0.6067 0.4743 0.6273 0.9905 0.4427 0.3443 0.9388 0.2250 0.6253 0.0988 0.6561 0.6723 0.7427 0.8249 0.6925 0.2711 >> U = copularnd('Gaussian',.4,10000); >> corr(U) >> X = norminv(U,0,1); >> corr(X) ans = 1.0000 0.3901 0.3901 1.0000 >> Xg = gaminv(U,2,3); >> corr(Xg) ans = 1.0000 0.3721 0.3721 1.0000 >> ans = 1.0000 0.3765 0.3765 1.0000 Matlab (a little more clear) Old Slides Grabbing a Seed from the System Clock (Stata) program define seedset local ct =c(current_time) local s1=substr("`ct'",7,2) local s2=substr("`ct'",4,2) local s3=substr("`ct'",2,1) global newseed=real("`s1'" +"`s2'" +"`s3'") di $newseed set seed $newseed end LCG is default for Stata . set obs 100 obs was 0, now 100 . . . . gen x0=ceil(uniform()*100) gen m=ceil(uniform()*10) gen x1=mod(x0,m) list in 1/10 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. +------------+ | x0 m x1 | |------------| | 70 2 0 | | 62 7 6 | | 92 7 1 | | 53 1 0 | | 37 3 1 | |------------| | 78 1 0 | | 47 2 1 | | 91 2 1 | | 98 1 0 | | 71 9 8 | +------------+ Testing Randomness (Stata) • Correlogram of Xi versus Xi+k for serial correlation . gen tv=_n . tsset tv time variable: delta: tv, 1 to 20000 1 unit . corrgram x, lags(40) -1 0 1 -1 0 1 LAG AC PAC Q Prob>Q [Autocorrelation] [Partial Autocor] -----------------------------------------------------------------------------1 0.0026 0.0026 .13551 0.7128 | | 2 -0.0011 -0.0011 .15995 0.9231 | | 3 -0.0004 -0.0004 .16301 0.9833 | | 4 -0.0131 -0.0131 3.5987 0.4630 | | 5 -0.0008 -0.0007 3.6115 0.6066 | | 6 0.0119 0.0118 6.4238 0.3774 | | 7 -0.0060 -0.0061 7.1533 0.4131 | | 8 0.0004 0.0003 7.1571 0.5198 | | 9 0.0057 0.0057 7.815 0.5529 | | 10 -0.0049 -0.0046 8.2893 0.6006 | | 11 -0.0097 -0.0099 10.19 0.5134 | | 12 0.0044 0.0043 10.581 0.5651 | | 13 0.0087 0.0090 12.102 0.5193 | | 14 0.0081 0.0079 13.424 0.4935 | | 15 0.0068 0.0064 14.357 0.4986 | | 16 0.0068 0.0071 15.285 0.5039 | | 17 -0.0029 -0.0025 15.449 0.5632 | | (Stata) . seedset 23491 . set obs 2000 obs was 0, now 2000 . gen P=uniform() . gen enum=-ln(P)/.04 . ci Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------P | 2000 .5030619 .0065298 .4902559 .5158678 enum | 2000 24.79822 .553316 23.71309 25.88336 . set obs 200 obs was 0, now 200 (Stata) . gen P=uniform() . gen tte=(-ln(P)/0.1)^1.5 . gen fail=1 . replace fail=0 if tte>200 . replace tte=200 if tte>200 . stset tte, fail(fail) failure event: obs. time interval: exit on or before: fail != 0 & fail < . (0, tte] failure ------------------------------------------------------------------200 total obs. 0 exclusions ------------------------------------------------------------------200 obs. remaining, representing 188 failures in single record/single failure data 9019.163 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 200 (Stata) 0.00 0.25 0.50 0.75 1.00 Kaplan-Meier survival estimate 0 50 100 analysis time 150 200 (Stata) . streg, d(w) nohr failure _d: analysis time _t: fail tte Weibull regression -- log relative-hazard form No. of subjects = No. of failures = Time at risk = Log likelihood = 200 188 9019.163067 -403.39593 Number of obs = 200 LR chi2(0) Prob > chi2 = = 0.00 . -------------------------------------------------------------------_t | Coef. SE z P>|z| [95% CI] -------------+-----------------------------------------------------_cons | -2.245 0.167 -13.47 0.000 -2.572 -1.918 -------------+-----------------------------------------------------delta | 0.625 0.036 0.558 0.701 -------------------------------------------------------------------. . gen P=uniform() . gen tte=(-ln(P)/(exp(log(0.1)+log(0.5)*drug)))^1.5 (Stata) . gen fail=1 . replace fail=0 if tte>200 (39 real changes made) . replace tte=200 if tte>200 (39 real changes made) . stset tte, fail(fail) failure event: obs. time interval: exit on or before: fail != 0 & fail < . (0, tte] failure -----------------------------------------------------------------------------400 total obs. 0 exclusions -----------------------------------------------------------------------------400 obs. remaining, representing 361 failures in single record/single failure data 25170.04 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 200 . (Stata) . list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. drug P tte fail in 1/15 +-----------------------------------+ | drug P tte fail | |-----------------------------------| | 1 .842721 6.331312 1 | | 0 .3839878 29.6119 1 | | 1 .3483792 96.8484 1 | | 0 .6035132 11.34804 1 | | 1 .8460417 6.114305 1 | |-----------------------------------| | 1 .4935982 53.06192 1 | | 1 .5173908 47.84433 1 | | 1 .385052 83.39208 1 | | 0 .8726683 1.589515 1 | | 0 .0356283 192.5611 1 | |-----------------------------------| | 0 .8018837 3.280757 1 | | 0 .6059877 11.21039 1 | | 1 .7919235 10.07838 1 | | 0 .1920578 67.02081 1 | | 0 .0819428 125.1301 1 | +-----------------------------------+ (Stata) 0.00 0.25 0.50 0.75 1.00 Kaplan-Meier survival estimates 0 50 100 analysis time drug = 0 150 drug = 1 200 (Stata) . streg drug, d(w) nohr failure _d: analysis time _t: fail tte Weibull regression -- log relative-hazard form No. of subjects = No. of failures = Time at risk = Log likelihood = 400 361 25170.03819 -757.69677 Number of obs = 400 LR chi2(1) Prob > chi2 = = 53.70 0.0000 -------------------------------------------------------------------_t | Coef. SE z P>|z| [95% CI] -------------+-----------------------------------------------------drug | -0.788 0.107 -7.34 0.000 -0.998 -0.577 _cons | -2.458 0.143 -17.19 0.000 -2.738 -2.177 -------------+-----------------------------------------------------delta | 0.706 0.031 0.648 0.768 -------------------------------------------------------------------. (Stata) Generating Multivariate Normal Random Numbers In Stata , gennorm (webseek to download): Typing . gennorm a b c, corr(.2 .3 .4) creates a, b, and c with value draw from a N(0,S) distribution where +-+ | 1 | S = | .2 1 | | .3 .4 1 | +-+ That is, corr(a,b)=.2, corr(a,c)=.3, and corr(b,c)=.4 CONTINUED NEXT PAGE (Stata) Generating Multivariate Normal Random Numbers In Stata: Example ------. set obs 10000 obs was 0, now 10000 . set seed 6819 . gennorm a b c, corr(.2 .3 .4) . summarize a b c Variable | Obs Mean Std. Dev. Min Max -------------+---------------------------------------------------------------------------a | 10000 -.0105333 1.005723 -3.694448 3.775433 b | 10000 -.0042212 1.000254 -3.695302 3.648826 c | 10000 -.0069625 .9989002 -3.996779 3.606923 . corr a b c (obs=10000) | a b c -------------+--------------------------------a | 1.0000 b | 0.2137 1.0000 c | 0.3035 0.3952 1.0000 (Stata) Generating Multivariate Normal Random Numbers In Stata, drawnorm: . clear . matrix C=(1, 0.2, 0.3 \ 0.2, 1, 0.4 \ 0.3, 0.4, 1) . drawnorm a b c, n(10000) corr(C) (obs 10000) . summarize a b c Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------a | 10000 -.0176275 .9920181 -3.701594 3.7838 b | 10000 .0009005 1.003002 -3.709259 3.518793 c | 10000 -.0149926 .9925292 -3.716346 4.009713 . corr a b c (obs=10000) | a b c -------------+--------------------------a | 1.0000 b | 0.1937 1.0000 c | 0.3051 0.4056 1.0000 Simulating Weibull Regression Data, with Time-Dependency in Drug Effect Survival Time: S ( P) exp t , exp( x) x -2.30 0.3* drug - 0.08* drug * t Inverse Prob Transform: ????????? How do you solve for t? (Not all answers are in the book.) Remember Newton’s Method? t0 t 1… f (t ) exp exp(-2.30 0.3* drug - 0.004* drug * t ) t P f (t * ) 0 f (t d ) f (t ) f (t ) d f (t0 ) t1 t0 f (t0 ) clear set obs 400 gen drug=_n>200 gen double P=uniform() gen double t=1 gen double tpd=t+.0001 gen double f=exp(-exp(-2.30+0.3*drug-0.004*drug*t)*(t^0.67))-P gen double fp=exp(-exp(-2.30+0.3*drug-0.004*drug*tpd)*(tpd^0.67))-P gen double slope=(fp-f)/0.0001 forvalues i=1/50 { qui replace f=exp(-exp(-2.30+0.3*drug-0.004*drug*t)*(t^0.67))-P qui replace fp=exp(-exp(-2.30+0.3*drug-0.004*drug*tpd)*(tpd^0.67))-P qui replace slope=(fp-f)/0.0001 qui replace t=t-f/slope qui replace tpd=t+.0001 } (Stata) Matlab Matlab >> drug=[zeros(1000,1);ones(1000,1)]; >> P=rand(2000,1); >> cdf0=exp(-1.0986+0.6931*drug)./(1+exp(-1.0986+0.6931*drug)); >> outcome=P<=cdf0; >> b = glmfit(drug,outcome,'binomial') b= -1.0616 0.6562 (Stata) 0.00 0.25 0.50 0.75 1.00 Kaplan-Meier survival estimates 0 50 100 analysis time drug = 0 150 drug = 1 200 . gen P=uniform() . gen cdf0=exp(-1.0986+0.6931*drug)/(1+exp(1.0986+0.6931*drug)) . list in 1/10 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. +----------------------------+ | drug P cdf0 | |----------------------------| | 0 .2865897 .2500023 | | 0 .3788754 .2500023 | | 1 .3597057 .3999916 | | 1 .7182508 .3999916 | | 1 .4315197 .3999916 | |----------------------------| | 1 .2963237 .3999916 | | 1 .7961193 .3999916 | | 0 .056983 .2500023 | | 0 .4622037 .2500023 | | 0 .5336403 .2500023 | +----------------------------+ (Stata) (Stata) . gen outcome=P<=cdf0 . list in 1/10 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. +--------------------------------------+ | drug P cdf0 outcome | |--------------------------------------| | 0 .2865897 .2500023 0 | | 0 .3788754 .2500023 0 | | 1 .3597057 .3999916 1 | | 1 .7182508 .3999916 0 | | 1 .4315197 .3999916 0 | |--------------------------------------| | 1 .2963237 .3999916 1 | | 1 .7961193 .3999916 0 | | 0 .056983 .2500023 1 | | 0 .4622037 .2500023 0 | | 0 .5336403 .2500023 0 | +--------------------------------------+ (Stata) . gen outcome=P<=cdf0 . logistic outcome drug Logistic regression Log likelihood = -1245.3138 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 2000 58.65 0.0000 0.0230 -------------------------------------------------------------------outcome | OR SE z P>|z| [95% CI] -------------+-----------------------------------------------------drug | 2.084 0.202 7.57 0.000 1.723 2.519 -------------------------------------------------------------------_cons | -1.077 0.073 -14.83 0.000 -1.220 -0.935 -------------------------------------------------------------------- .