Documentation of EG21 TEXT and related files: last update: BQ 6/12/1991 Purpose: fitting multivariate binary regression models that allow ------more than one class in each cluster, and a different regression for each class and for the dependence between and within classes. Allows the choice between GEE1 and GEE2. Version: 0.5 Beta test version. ------Environment: IBM-VM/CMS ----------Necessary files: To run the program the following files are needed --------------EGEE EXEC A : the exec that executes the program EG21 TEXT A : the executable code QMATRIX TXTLIB A : library needed by the program XXX YYY ? : the data file ZZZ EGEE A : the control file ( XXX, YYY and ZZZ are specified by the user ) Output: ZZZ OUTPUT A -----FILE STDERR A FILE MFILE A FILE PROBEF A : the output : reserved for debugging purposes. : reserved for debugging purposes. : reserved for debugging purposes. The user should not be concerned with these files. Future versions of the program will not generate them. To run the program: From the CMS command line issue: -----------------EGEE ZZZ assuming the control file is named ZZZ EGEE A. Output will go to ZZZ OUTPUT A. If ZZZ OUTPUT A exists it will be replaced (not appended to) by the new file. Data file format: Free format with one record per observation. ---------------The variables are: Cluster id The class number The response variable (y = 0/1) The regressor(s) The data file could have RECFM F or V. There is no restriction on the LRECL. Control file format: ------------------The control file can have RECFM F or V, maximum LRECL is 133. The first record must contain four zeros as follows: 0 0 0 0 (this will not be needed on the final version of the program) The second and third records are titles that will be printed on the output file. The fourth record is the data file name. The fifth record contains an integer, the number of classes. See below for maximium allowed. The sixth record contains an integer, the number of variables that follow the response in the data file. It is not necessary that all these variables be used in the regressions. The seventh record contains two integers, i1 and i2: i1 = number of parameters. i2 = number of parameters for main effects. Naturally i1 is greater than or equal to i2. (not checked) It must be arranged so that the odds-ratios parameters are the last in the parameter vector. (not checked) The eigth record contains a real number, the convergence criterion. Iteration stops when the sum of the absolute changes in all parameters between two iterations is less than that number or the maximum number of iterations is reached, whichever occurs first. The ninth record contains an integer, the maximum number of iterations. The tenth record contains an integer i1, say. If i1 = 1 then the current estimates of the parameters will printed at each iteration. The eleventh record contains an integer, i. i = 1 : GEE1 i = 2 : GEE2 The twelveth record contains an integer i1, say. If i1 = 1 then the Zhao and Prentice formulae for third and fourth order moments will be used. if i1 = 2 then the exact solution will be used for these moments. The thirteenth record is ignored. The fourteenth and following records, as many as there are parameters, specify labels for the parameters. These will be used to label the output. Only the first 16 characters will be used. The following record is ignored. The following record(s) contain initial values for the parameters. These may span one or more records. The following record is ignored. The following records specify the regressions. If the number of classes is C, then C + C + {C * (C-1) / 2} records are required. C specifications for the regressions for each class. C specifications for the regressions for the within class odds ratios. C * (C-1) / 2 specifications for the regressions for the between class odds ratios. examples: C number of specifications 1 2 2 5 3 9 4 14 5 20 6 27 7 35 8 44 Each regression is specified by a sequence of integers as follows: i1 i2 i3 i4 i5 i6 ... where i1 and i2 are class numbers. To specify the regression for the main effects for a class set i2 = 0. i3, i5, .. are the parameter indices. i4, i6, ... are the regressor indices. If B is the regression parameter and x is the vector of regressors in the data file then the regression will be B(i3)*x(i4) + B(i5)*x(i6) + ... If i3 = 0 then that regression is set to 0. It must be arranged so that the odds-ratios parameters are the last in the parameter vector. (not checked) Each parameter should appear at least once in the regression specifications. Order of specification not important. Any following records will be ignored. Note: extra text following numbers on the control file is allowed except on the following: record number 4: the data file name the record(s) specifying the initial parameter values the record(s) specifying the regressions. This is demonstrated by the example control file. Current program limits: -------------maximum number of classes = 12 maximum cluster size = 12 maximium number of observations: The sum of n + n * (n-1) / 2 over all clusters must be <=16000, where n is the cluster size. maximum number of parameters = 60 maximum number of potential regressors in the input file = 64 maximum number of iterations that could be specified = 100 Technical notes: --------------- The values of the regressors used in the regression for the within and between class associations should be the same for all members of any given cluster. The program currently uses the values from the last member in each cluster. Don't rely on this "feature". This will change in future versions of the program. The program does a fair amount of checking on the control file and the data file. However it is not an exhaustive check. The model specification is very flexible. Completely ridiculous models can be specified. The program has no way of recognizing these. Care is needed here. Example control file: -------------------0 0 0 0 (reserved, must be as shown) -- Title1: example control file --- Title2: -COPD1 DATA A 2 = number of classes. Suppose class 1 = P, class 2 = S 6 = dim (x): x1 x2 x3 x4 x5 x6 9 6 = total=9, main effects=6 0.001 = convergence criterion 50 = maximun number of iterations 1 = print current estimates each iteration. 1=yes, 0=no 2 = 1=GEE1, 2=GEE2 2 = exact 2=exact 1=Z&P approx. labels for beta: these will appear on the output file 1 Intercept 2 Sex (F) 3 Race (B) 4 Age-50 5 Smoker 6 Ex smoker 7 P.P 8 S.S 9 P.S Initial estimates: -0.83188 -0.80439 -0.91741 0.03796 1.14924 0.39144 0.93362 0.934 0.934 model specification: 1 0 1 1 2 2 3 3 4 4 5 5 6 6 2 0 1 1 2 2 3 3 4 4 5 5 6 6 1 1 7 1 2 2 8 1 1 2 9 1 -- End of the control File -The model specified above is: Main effects: For class 1 (P): logit(Pr{Y=1}) = B1*x1 + B2*x2 + B3*x3 + B4*x4 + B5*x5 + B6*x6 For class 2 (S): logit(Pr{Y=1}) = B1*x1 + B2*x2 + B3*x3 + B4*x4 + B5*x5 + B6*x6 Odds ratios: Within class 1 (P.P): log(odds ratio) = B7*x1 Within class 2 (S.S): log(odds ratio) = B8*x1 Between classes 1 and 2 (P.S): log(odds ratio) = B9*x1 Suppose that a model with no association between classes 1 and 2 is required. Then the last line of the control file should be: 1 2 0