Preventing data loss when combining SAS datasets.

advertisement
Preventing data loss when
combining SAS datasets.
John Cantrell
Univ. North Carolina
Chapel Hill, NC
SAS has powerful tools for
dataset combination
 Proc
SQL
 SAS data step
SAS data step

Datastep methods use SET, MERGE,
and BY statements
MERGE
SET
statement statement
no BY statement 1 to 1
merge
followed with BY matched
statement
merge
concatenation
interleaving
The workhorses:
 Matched
merge & concatenation.
 Data twothree;
merge two three;
run;
 Forgot BY statement, yields 1 to 1
merge instead of matched merge
LOG: errors, warnings & notes
 Unintentional
1 to 1 merge caught
with “options mergenoby=error;”
 Error messages in the LOG help to
find mistakes
 Not all problems yield error
messages: the note, “repeats of
BY values” > many-to-many merge
LOG does not flag all mistakes
 A serious
problem can occur
without any message in the LOG
 For example, data loss by
truncation due to unequal variable
lengths
Example: Concatenation
 Concatenation
of 2 datasets
named two and three, each input
dataset contains:
 A numeric
variable named id.
 A character
variable named var.
Example: input datasets:
dataset two
id
var
dataset three
id
var
1
2
3
4
5
6
7
0
1
2
3
4
8
9
21
23
24
25
27
2g
2k
3ks
31a
35a
34a
350
370
3gv
Example: variable “var”
 two:var
and three:var are not the
same
 two:var has a length of $2, all
values begin with the character “2”
 three:var has a length of $3, all
values begin with the character “3”
Example: the datastep
 data
twothree;
set two three;
run;
 Note: the output dataset name
indicates the order of input
datasets in SET statement.
Dataset twothree
id
1
2
3
4
5
6
7
var
21
23
24
25
27
2g
2k
id
0
1
2
3
4
8
9
var
3k
31
35
34
35
37
3g
Example: data loss
 In
output:
twothree:id=1
var =21 and var=31
 In inputs:
two:id=1
var=21
three:id=1
var=31a
 Value of var in twothree that came
from three lacks the ‘a’, Why?
Explanation:
 Variable
length in output is
determined by variable length in
the first (leftmost) input dataset in
the SET statement, two:var $2.
 31a does not fit into a character
variable of length $2, so the ‘a’ is
lost through truncation.
 LOG does not mention data loss.
Example: another datastep
 What
would happen if we reverse
the order of the datasets in the
SET statement?
 data
threetwo;
set three two;
run;
Dataset threetwo
id
0
1
2
3
4
8
9
var
3ks
31a
35a
34a
350
370
3gv
id
1
2
3
4
5
6
7
var
21
23
24
25
27
2g
2k
Example: Matched merge
 Matched
merge: data from last
dataset overwrites data from first
dataset.
 Usually re-name variables with
same name to avoid overwriting
(except BY variables).
 Maybe you want matched values to
overwrite, so you do not rename.
Example: Matched merge
 data
twothree;
merge two three;
by id;
run;
Example: dataset twothree
id
var
id
var
0
3k
5
27
1
31
6
2g
2
35
7
2k
3
34
8
37
4
35
9
3g
Length of variable:
 Determined
by first, or left most,
dataset in MERGE statement.
 Again, the LOG is silent on data
loss through truncation.
 What to do?
Solutions:
 Complain
to SAS
 Until SAS corrects, use macro
before merge or concatenation to
check for unequal character
variable lengths
 If found, re-size variables
Macro: verifyVariables.sas
 Invoke
before merge or
concatenation.
 %verifyVariables(two,three,set);
data twothree;
set two three;
run;
How does it work?
 proc
contents on two and three,
outputs results to datasets
 Merges datasets and uses
datastep to find like named
character variables
 It they have different lengths, puts
ERROR statement into log.
Macro code and slides:
On website:
http://www.unc.edu/~jcantrel/
 Macro in paper.
 John Cantrell
University of North Carolina
Chapel Hill, NC 27599-7411
(919)843-6495
john_cantrell@unc.edu

Download