Merging in SAS • These slides show alternatives regarding the merge of two datasets using the IN data set option (check in the SAS onlinedoc > “BASE SAS”, “SAS Language Reference: Dictionary” > “Data step options” > “IN=“ • In the slides, the red data goes into the merged data set. The greyed out observations are left out. The perfect merge Dataset A ID V1 1 123 2 421 3 129 4 122 5 232 6 534 7 343 8 324 V2 123 434 436 767 34 435 89 6787 Dataset B ID V3 1 343 2 85 3 325 4 763 5 229 6 554 7 884 8 895 V4 343 4234 434 234 324 324 34 342 Not so perfect (if a or b;) Dataset A (in=a) ID V1 V2 2 3 4 421 129 122 434 436 767 6 7 8 534 343 324 435 89 6787 Dataset B (in=b) ID V3 V4 1 343 343 2 85 4234 4 5 6 763 229 554 234 324 324 8 895 342 If a=b; (both datasets contribute) Dataset A (in=a) ID V1 V2 2 3 4 421 129 122 434 436 767 6 7 8 534 343 324 435 89 6787 Dataset B (in=b) ID V3 V4 1 343 343 2 85 4234 4 5 6 763 229 554 234 324 324 8 895 342 If a; (must be in dataset A) Dataset A (in=a) ID V1 V2 2 3 4 421 129 122 434 436 767 6 7 8 534 343 324 435 89 6787 Dataset B (in=b) ID V3 1 343 2 85 . . 4 763 5 229 6 554 . . 8 895 V4 343 4234 . 234 324 324 . 342 If b; (must be in dataset B) Dataset A (in=a) ID V1 . 2 421 3 129 4 122 . 6 534 7 343 8 324 V2 . 434 436 767 . 435 89 6787 Dataset B (in=b) ID V3 V4 1 343 343 2 85 4234 4 5 6 763 229 554 234 324 324 8 895 342 Notes • The examples assume there is a unique identifier. This can be either one variable (ex, CRSP's PERMNO or Compustat's GVKEY) or more than one variable (for example, PERMNO and DATE for a panel dataset). • Assumption: Both data sets are sorted by the unique identifier(s). Sample code proc sort data=yourdata; by permno date; proc sort data=otherdata; by permno date; data newdata; merge yourdata (in=a) otherdata (in=b); by permno date; /* note by variables are in the same order */ /* as the sort by variables) */ /* below this, you write your control statement, one of the following */ if a; if b; if a and b; if not a; if not b; Typical problems • If both datasets were complete (they both have the same observed units, then the IF statements would be unnecessary; "if a and b" would be equivalent to leaving the statement out altogether) • If you do not have a BY statement (no identifier -- you somehow know that each row of one datasets corresponds to the same one row in the other dataset), the datasets are just "glued" side-by-side. • Common mishaps: the by variables have different formats across datasets, SAS will merge the datasets, but will put a WARNING in the log. Another common mishap is to have variables with the same name (that are not the ID) -- one of the will be overwritten. References Good references are • http://ftp.sas.com/techsup/download/technote/ts64 4.html • and a manual called "Combining and modifying SAS data sets: examples", which is in the RC library. It has a lot of example. Unfortunately, it does not exist in an online version (only the code is available, but the explanations are very good).