Preventing data loss when combining SAS datasets. John Cantrell Univ. North Carolina Chapel Hill, NC SAS has powerful tools for dataset combination Proc SQL SAS data step SAS data step Datastep methods use SET, MERGE, and BY statements MERGE SET statement statement no BY statement 1 to 1 merge followed with BY matched statement merge concatenation interleaving The workhorses: Matched merge & concatenation. Data twothree; merge two three; run; Forgot BY statement, yields 1 to 1 merge instead of matched merge LOG: errors, warnings & notes Unintentional 1 to 1 merge caught with “options mergenoby=error;” Error messages in the LOG help to find mistakes Not all problems yield error messages: the note, “repeats of BY values” > many-to-many merge LOG does not flag all mistakes A serious problem can occur without any message in the LOG For example, data loss by truncation due to unequal variable lengths Example: Concatenation Concatenation of 2 datasets named two and three, each input dataset contains: A numeric variable named id. A character variable named var. Example: input datasets: dataset two id var dataset three id var 1 2 3 4 5 6 7 0 1 2 3 4 8 9 21 23 24 25 27 2g 2k 3ks 31a 35a 34a 350 370 3gv Example: variable “var” two:var and three:var are not the same two:var has a length of $2, all values begin with the character “2” three:var has a length of $3, all values begin with the character “3” Example: the datastep data twothree; set two three; run; Note: the output dataset name indicates the order of input datasets in SET statement. Dataset twothree id 1 2 3 4 5 6 7 var 21 23 24 25 27 2g 2k id 0 1 2 3 4 8 9 var 3k 31 35 34 35 37 3g Example: data loss In output: twothree:id=1 var =21 and var=31 In inputs: two:id=1 var=21 three:id=1 var=31a Value of var in twothree that came from three lacks the ‘a’, Why? Explanation: Variable length in output is determined by variable length in the first (leftmost) input dataset in the SET statement, two:var $2. 31a does not fit into a character variable of length $2, so the ‘a’ is lost through truncation. LOG does not mention data loss. Example: another datastep What would happen if we reverse the order of the datasets in the SET statement? data threetwo; set three two; run; Dataset threetwo id 0 1 2 3 4 8 9 var 3ks 31a 35a 34a 350 370 3gv id 1 2 3 4 5 6 7 var 21 23 24 25 27 2g 2k Example: Matched merge Matched merge: data from last dataset overwrites data from first dataset. Usually re-name variables with same name to avoid overwriting (except BY variables). Maybe you want matched values to overwrite, so you do not rename. Example: Matched merge data twothree; merge two three; by id; run; Example: dataset twothree id var id var 0 3k 5 27 1 31 6 2g 2 35 7 2k 3 34 8 37 4 35 9 3g Length of variable: Determined by first, or left most, dataset in MERGE statement. Again, the LOG is silent on data loss through truncation. What to do? Solutions: Complain to SAS Until SAS corrects, use macro before merge or concatenation to check for unequal character variable lengths If found, re-size variables Macro: verifyVariables.sas Invoke before merge or concatenation. %verifyVariables(two,three,set); data twothree; set two three; run; How does it work? proc contents on two and three, outputs results to datasets Merges datasets and uses datastep to find like named character variables It they have different lengths, puts ERROR statement into log. Macro code and slides: On website: http://www.unc.edu/~jcantrel/ Macro in paper. John Cantrell University of North Carolina Chapel Hill, NC 27599-7411 (919)843-6495 john_cantrell@unc.edu