Understanding the LENGTH statement in the SAS data step

advertisement
Understanding the LENGTH statement in the SAS data step Background: SAS data structures (up thru version 9.*) have always employed two possible data types, numeric and character. There are hints from SAS tech support that rich data types will be present whenever V10 comes along, which may make several of the following points moot. But until they actually offer rich data types similar to what is possible inside a SQL table structure, it’s important to understand how the proper use of the LENGTH statement can impact the storage characteristics of your underlying SAS data files. These comments apply to both temporary (WORK.******) files AND permanent SAS databases. Defaults: In those instances where a field is created for the first time, the SAS component which builds the Data Set Vector (or DSV) will look over the statements defined within the overall data step and determine when the field is to be numeric or character. If no clues are present for a given field, it will default the type to numeric. You of course have complete control over these settings so it’s always advised that you pre‐define the layout with either a LENGTH or ATTRIB statement (we will focus on LENGTH within this talk). Let’s start with a simple example to see how the system determines the setup as well as the default values. DATA EXAMPLE1 ; A=1 ; B=”X” ; RUN ; PROC CONTENTS DATA=EXAMPLE1 ; RUN ; No LENGTH or ATTRIB statement yet exists so the system looks at the first instance of each field and sets the types and lengths based on the review of the first instance of each defined field. Variable A is being assigned a number so SAS makes this a numeric field and if not otherwise instructed, all numeric variables will receive a length of 8 bytes. Variable B is going to store a character field since the first occurrence of the field is assigned a value inside matching quotes, which the syntax reference designating a character string within the data step. And because that first record only has a length of one column, the length will only be 1. Partial results from the PROC CONTENTS confirms this. Alphabetic List of Variables and Attributes
#
Variable
Type
Len
1
2
a
b
Num
Char
8
1
Attribute Inheritance: As subsequent data files are referenced in a script, the previously defined characteristics of the field are simply pulled into the next data set being created. The SET, MERGE or UPDATE statements will cause the DSV for that new structure to define the attributes of the previously existing fields. Most folks are probably used to modifying certain attributes as a script progresses, such as the assigned format or label definition. Fortunately in our case, it’s also possible to redefine the length of a given field if so desired and we will see an example of how to accomplish this later in this document. Before doing that, it’s best to understand the optimal settings for the character and numeric variables within any of your SAS data files. Character Field Recommendations: It’s best to set the length of any character field to the longest possible string you anticipate to store within that given variable. This is easily accomplished by adding the field name to a LENGTH statement, following the field with a $ symbol and overall length you wish to request. LENGTH B $1 ; would be the syntax from the previous example. This is where it’s important to know your data since the number you specify after the $ should always be the largest number of columns your character field will consume. Listing anything larger than that longest string is simply wasting resources since the DSV will carve out extra bits that will only contain padded blanks. Realize that these blanks still consume space. Likewise, setting the character length to anything less than your maximum string length will result in possible truncation of data values. The range of possible character field lengths on the Windows platform is between 1 and 32768. Numeric Field Recommendations: Earlier within the PROC CONTENTS, we saw the default length for numeric fields as 8 bytes per variable….which happens to be the largest possible numeric length. SAS takes this conservative approach in the event the variable contains any decimal information. It’s important you understand your data because if a field is a non‐integer, you definitely want to leave the numeric length at 8 to maintain the highest precision possible. Technically, you don’t need to mention these numeric fields on a LENGTH statement since the software will default the field to length 8 if not otherwise instructed (as per our earlier example). But for those numeric fields you know will always contain integer values, you’ll be making your programs more efficient and helping the Storage Area Network (SAN) by cutting down the size of your temporary and permanent data files. On the Windows platform, SAS allows the numeric length to vary between 3 and 8 bytes. To understand these differences, the second and fourth columns below should be consulted to determine which length to use for your field while maintaining the precision within your given column. Significant Digits and Largest Integer by Length for SAS Variables under Windows
Length in Bytes
Largest Integer
Significant Digits
Represented Exactly Exponential Notation Retained
3
8,192 213
3
4
2,097,152 221
6
5
536,870,912 229
8
6
137,438,953,472 237
11
7
35,184,372,088,832 245
13
8
9,007,199,254,740,992 253
15
Let’s consider four scenarios in the majority of our work. (1) The creation of binary or simple indicator values (0,1,2,….). The range of these values seldom gets beyond 3 digits and most certainly never up to the 8192 upper range value for a field with a defined length of 3. So any simple indicator column should always contain a LENGTH variable1 variable2 …..variable_n 3 ; statement within the realm of the data step creating this field. Note how each variable listed will pick up the length characteristic listed after the end of the last field. (2) The ***SSN field(s). Because a numeric length of 5 only maintains precision to 8 positions, it will not sufficiently contain the true values for any SSN that is beyond the 536870912 value. The software would begin to round these higher values and because the precision is lost, your data loses its integrity. Thus, a length of 6 on the Windows platform is optimal for storing the SSNs since it is actually good thru 11 positions and we are only worried about precision thru the first 9 positions. (3) SAS date fields. Because the SAS engine stores the internal date as the number of individual days between January 1, 1960 and the date field within your data file, pretty much every date variable can be safely stored with a LENGTH of 4. Consider that March 2010 is in the 18,300 range of days. When compared against the above table, it’s obvious a day 2+ million days into the future is beyond anything you need to worry about. (4) SAS date/time fields. The SAS engine also uses midnight January 1, 1960 as the origin for determining date/time values but instead counts the number of seconds till the date/time stored within your data file. A quick look at the value for a March 2010 date/time shows the figure is already up to 10 positions. Thus, you’re probably safe using a LENGTH of 6 but could possibly need it to be 7 or 8 depending upon how far in the future (or past) your projection happened to be. As before, it’s important to know your data when making this definition. Impact on Storage: Now that we’ve seen what the recommendations are, let’s take a look at a working example of how to implement these changes against a pre‐existing data file. The new master patient demographic file was originally created with all 6 fields being defined as numeric, each with the default length of 8. Here is a portion of the CONTENTS snapshot. Alphabetic List of Variables and Attributes
# Variable
Type Len Format Informat Label
6 AGE2010MAR11
3 dob
2 dob_assign_
level
1 scrssn
5 sex
4 sex_assign_
level
Num
Num
Num
8
8
8
Num
Num
Num
8 SSN11. BEST22.
8
8





COMPUTED AGE ON 2010MAR11
1: source priority, 2: concensus,
3: priority w inconsistency
SCRAMBLED SOCIAL SECURITY NUMBER
1: source priority, 2: concensus,
3: priority w inconsistency
Even without knowing much about this data, one can easily assume that the AGE related field will be a 3 digit value at the very most. Since the length of 3 will house values up to 8192, this is the obvious choice to assign to this field. The DOB field is short for Date Of Birth and thus a SAS date column. Our earlier discussion on SAS dates means we should use a length of 4 on this field. The labels on the DOB_ASSIGN_LEVEL and SEX_ASSIGN_LEVEL provide a clue that each field consists of at most 3 possible small integer values. Again, a SAS length of 3 is the obvious choice. The SCRSSN field is the scrambled SSN of the patient. Our earlier discussion on SSNs means we should use a length of 6 here. Finally, the SEX field is not immediately obvious. But since we know there are very few possible choices for the patient’s gender, chances are high this is going to a simple (small) integer list. A quick ad‐hoc frequency distribution confirms as much, so a length of 3 is again in order here. To get these newly defined lengths into a data structure, we simply have to define the new lengths on a LENGTH statement prior to the mention of the SET statement within our example build, as per the following. If the SET statement were to precede the LENGTH statement, the original field characteristics would be defined into the DSV structure and the LENGTH statement would basically have no impact. Because we list it first, the DSV will move sequentially from top to bottom of the data step as a first step and define each field’s characteristic per its first mention. Then as the DSV starts to populate with data records (per the execution of the SET statement), the data will simply populate into the new structure and be output to the file we have designated on the DATA statement. libname test 'f:\sas_stuff\data\consistent_variables';
data test.example;
length scrssn 6 dob 4 dob_assign_level sex_assign_level sex age2010mar11 3;
set test.fy10dob_sex2010mar11;
run;
Note that I’ve reset the order of the fields slightly but position isn’t of major importance with the SAS environment since the field references are done explicitly by variable name. In the event you truly want to maintain the order, you should explicitly list each field as it was previously defined in the original file setup. After running the above steps to create my smaller TEST.EXAMPLE file, I also ran a comparison of the original file against the new structure with the following code: proc compare base=test.fy10dob_sex2010mar11 compare=test.example ; run;
The following results of that procedure show that none of the 5.3 million records have unequal values, meaning that the precision of the data was maintained by the smaller numeric lengths. The COMPARE Procedure
Comparison of TEST.FY10DOB_SEX2010MAR11 with TEST.EXAMPLE
(Method=EXACT)
Data Set Summary
Dataset
TEST.FY10DOB_SEX2010MAR11
TEST.EXAMPLE
Created
Modified
11MAR10:09:21:53
12MAR10:10:55:53
11MAR10:09:21:53
12MAR10:10:55:53
NVar
NObs
6 5385921
6 5385921
Variables Summary
Number of Variables in Common: 6.
Number of Variables with Differing Attributes: 6.
Listing of Common Variables with Differing Attributes
Variable
Dataset
Type
scrssn
TEST.FY10DOB_SEX2010MAR11
TEST.EXAMPLE
TEST.FY10DOB_SEX2010MAR11
TEST.EXAMPLE
TEST.FY10DOB_SEX2010MAR11
TEST.EXAMPLE
TEST.FY10DOB_SEX2010MAR11
TEST.EXAMPLE
TEST.FY10DOB_SEX2010MAR11
TEST.EXAMPLE
TEST.FY10DOB_SEX2010MAR11
TEST.EXAMPLE
Num
Num
Num
Num
Num
Num
Num
Num
Num
Num
Num
Num
dob_assign_level
dob
sex_assign_level
sex
AGE2010MAR11
Length
Format
Informat
8
6
8
3
8
4
8
3
8
3
8
3
SSN11.
SSN11.
BEST22.
BEST22.
Observation Summary
Observation
First Obs
Last Obs
Base
Compare
1
5385921
1
5385921
Number of Observations in Common: 5385921.
Total Number of Observations Read from TEST.FY10DOB_SEX2010MAR11: 5385921.
Total Number of Observations Read from TEST.EXAMPLE: 5385921.
Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 5385921.
NOTE: No unequal values were found. All values compared are exactly equal.
When looking at the source folder with Windows Explorer, it’s easy to see that we went from a 256Mb file down to a 128Mb file…..a savings of 50%!! Conclusions: This example focused on two permanent data files stored in the same folder so the impact could easily be displayed to the audience. The same concept applies whether your data file exists in the temporary WORK environment or one of your permanent data folders though. WORK files that employ these approaches and are built efficiently will allow the SAS servers to run quicker. Since most of your final permanent output files are built off combinations of your temporary tables, making these optimal definitions at the start will have a positive effect as they filter into all your subsequent sets. Extra: If you want to retrofit previously created data in a programmatic fashion, go to the SAS support site and search for the %SQUEEZE macro. Download the file and place it into your macro auto‐call folder on your system. (e.g. C:\Program Files\SAS\SAS V9.*\Core\SASmacro\ is the likely location) The internal documentation on this macro is good. The main thing to be aware of is you can’t overwrite a file directly. That is, the location specified in argument 1 of the macro call specification MUST be different than the location in argument 2. All the other settings are optional depending on your particular needs. 
Download