SAS merging

advertisement
Merging in SAS
• These slides show alternatives regarding the
merge of two datasets using the IN data set
option (check in the SAS onlinedoc >
“BASE SAS”, “SAS Language Reference:
Dictionary” > “Data step options” > “IN=“
• In the slides, the red data goes into the
merged data set. The greyed out
observations are left out.
The perfect merge
Dataset A
ID
V1
1
123
2
421
3
129
4
122
5
232
6
534
7
343
8
324
V2
123
434
436
767
34
435
89
6787
Dataset B
ID
V3
1
343
2
85
3
325
4
763
5
229
6
554
7
884
8
895
V4
343
4234
434
234
324
324
34
342
Not so perfect (if a or b;)
Dataset A (in=a)
ID
V1
V2
2
3
4
421
129
122
434
436
767
6
7
8
534
343
324
435
89
6787
Dataset B (in=b)
ID
V3
V4
1
343
343
2
85
4234
4
5
6
763
229
554
234
324
324
8
895
342
If a=b; (both datasets contribute)
Dataset A (in=a)
ID
V1
V2
2
3
4
421
129
122
434
436
767
6
7
8
534
343
324
435
89
6787
Dataset B (in=b)
ID
V3
V4
1
343
343
2
85
4234
4
5
6
763
229
554
234
324
324
8
895
342
If a; (must be in dataset A)
Dataset A (in=a)
ID
V1
V2
2
3
4
421
129
122
434
436
767
6
7
8
534
343
324
435
89
6787
Dataset B (in=b)
ID
V3
1
343
2
85
.
.
4
763
5
229
6
554
.
.
8
895
V4
343
4234
.
234
324
324
.
342
If b; (must be in dataset B)
Dataset A (in=a)
ID
V1
.
2
421
3
129
4
122
.
6
534
7
343
8
324
V2
.
434
436
767
.
435
89
6787
Dataset B (in=b)
ID
V3
V4
1
343
343
2
85
4234
4
5
6
763
229
554
234
324
324
8
895
342
Notes
• The examples assume there is a unique
identifier. This can be either one variable
(ex, CRSP's PERMNO or Compustat's
GVKEY) or more than one variable (for
example, PERMNO and DATE for a panel
dataset).
• Assumption: Both data sets are sorted by
the unique identifier(s).
Sample code
proc sort data=yourdata; by permno date;
proc sort data=otherdata; by permno date;
data newdata;
merge yourdata (in=a) otherdata (in=b);
by permno date;
/* note by variables are in the same order */
/* as the sort by variables)
*/
/* below this, you write your control statement,
one of the following */
if a;
if b;
if a and b;
if not a;
if not b;
Typical problems
• If both datasets were complete (they both have the same
observed units, then the IF statements would be
unnecessary; "if a and b" would be equivalent to leaving
the statement out altogether)
• If you do not have a BY statement (no identifier -- you
somehow know that each row of one datasets corresponds
to the same one row in the other dataset), the datasets are
just "glued" side-by-side.
• Common mishaps: the by variables have different formats
across datasets, SAS will merge the datasets, but will put a
WARNING in the log. Another common mishap is to have
variables with the same name (that are not the ID) -- one of
the will be overwritten.
References
Good references are
• http://ftp.sas.com/techsup/download/technote/ts64
4.html
• and a manual called "Combining and modifying
SAS data sets: examples", which is in the RC
library. It has a lot of example. Unfortunately, it
does not exist in an online version (only the code
is available, but the explanations are very good).
Download