How to Reduce the Disk Space Required by a SAS® Data Set

advertisement
NESUG 2006
Ins & Outs
How to Reduce the Disk Space Required by a SAS® Data Set
Selvaratnam Sridharma, U.S. Census Bureau, Washington, DC
ABSTRACT
SAS datasets can be large and disk space can often be at a premium. In this paper, SAS options like
COMPRESS, SAS statements like LENGTH and ATTRIB statements, SAS View, and Macros are discussed as to
how to reduce the size of a SAS dataset. An already developed %SQUEEZE macro can find the minimum lengths
required for both numeric and character variables in a SAS dataset, and use these minimum lengths for the
variables to reduce the size of the SAS dataset. Another macro %DROPMISS that is developed here can
automatically identify and drop SAS variables that have only missing or null values.
INTRODUCTION
When storing a large data set, storage space can be exhausted. By reducing the size of a dataset by
compressing the dataset using SAS COMPRESS= option, a large amount of storage space can be saved. Using
SAS Views instead of SAS datasets, a large amount of storage space can be saved. Another way to reduce the
size of a SAS data set is by saving only the needed variables in a dataset using DROP and/or KEEP options.
Also, using LENGTH or ATTRIB statements to assign the minimum lengths that are required for the variables in a
SAS dataset can reduce the size of a SAS data set. It is often difficult to find the minimum length required by a
variable in a SAS data set. An already developed macro %SQUEEZE finds the minimum lengths required by the
variables in a SAS data set and assigns the minimum lengths to these variables. The %SQUEEZE macro is
modified slightly here to make it more efficient. Sometimes all the values of some variables in a SAS data set are
missing and we would like to drop these variables to save storage space. Using the % DROPMISS macro that is
developed here can do this.
SAS SYSTEM OPTIONS / STATEMENTS
Some SAS system options / statements such as LENGTH, ATTRIB, KEEP, DROP, and COMPRESS can be used
to reduce the size required by a SAS data set.
LENGTH AND ATTRIB
Controlling the lengths of individual variables may greatly reduce the size of a SAS data set. LENGTH or ATTRIB
statement can be used to assign a length to a numeric or a character variable. For character variables, the
statement must occur in a data step before the first occurrences to the variables included in the statement. In a
SAS data set, for integers and character variables with short values this may dramatically decrease the size of the
data set. For character variables one byte corresponds to one character. Hence, to minimize storing space, set
the length of each character variable to the number of characters in the longest value of the variable. The
minimum length required by a numeric variable depends on the operating environment. Two examples are given
below.
15
16
17
18
19
Data X;
Length a b c 3
d e 5
F g $4;
set Y;
20
run;
50
51
52
53
54
55
Data X;
Attrib a b c length=3;
Attrib d e length=5;
Attrib f g length=$4;
set Y;
run;
The ATTRIB statement can also be used to change a variable's FORMAT, INFORMAT, and LABEL.
1
NESUG 2006
Ins & Outs
One needs to be careful when assigning the length of a numeric variable using the LENGTH statement. If the
length assigned for a numeric variable is not adequate, some of the values of that variable will be truncated in the
output data set. The statement will not generate an error. It is not advisable to change the lengths of non-integer
variables because you can loose the precision of some of the non-integer values. When the length assigned is not
adequate for a character variable, the length statement will generate an error.
KEEP AND DROP
When a SAS data set is created, only the needed variables should be kept. This could save a large amount of
space required to store the dataset. This can be done using KEEP= and/or DROP= to delete the unnecessary
variables. To save processing time, this should be done as early as logically possible as in the following example.
39
40
41
42
Data A (keep= a b q r);
Set B (drop = h k);
a= l+p;
b= r+q;
43
Run;
SAS COMPRESS
The COMPRESS= option is a SAS system option and a data set option that can be used to greatly reduce the
disk space required to store a SAS data set. You can set the option to either ‘YES’ or ‘BINARY’. In new versions
‘CHAR’ can be used instead of ‘YES’. If there are more character variables than numeric variables, generally it is
better to use COMPRESS = YES option. If there are more numeric variables than character variables, generally it
is better to use COMPRESS = BINARY. But both options should be tried to find out which one works better.
These options are used like as they are used in the following examples.
58
59
60
Data A (COMPRESS= YES);
SET SASHELP.EISMSG;
RUN;
NOTE: There were 1470 observations read from the data set SASHELP.EISMSG.
NOTE: The data set WORK.A has 1470 observations and 6 variables.
NOTE: Compressing data set WORK.A decreased size by 57.14 percent.
Compressed is 15 pages; un-compressed would require 35 pages.
61
62
63
Data A (COMPRESS= BINARY);
SET SASHELP.EISMSG;
RUN;
NOTE: There were 1470 observations read from the data set SASHELP.EISMSG.
NOTE: The data set WORK.A has 1470 observations and 6 variables.
NOTE: Compressing data set WORK.A decreased size by 54.29 percent.
Compressed is 16 pages; un-compressed would require 35 pages.
When COMPRESS = option is used as a SAS system option in the beginning of a program, all the SAS datasets
created by the program will be compressed. An option to use with COMPRESS= is REUSE= option. Specifying
this option allows SAS to reuse space within the compressed SAS data set that has been freed by deleted
observations.
All compressed SAS data sets are uncompressed by SAS prior to being used in computations in the DATA or
PROC steps. So, although compression saves disk space, it requires additional CPU time to compress and
uncompress.
Sometimes, compressing will result in a file larger than the uncompressed file if the uncompressed file is small.
Beginning with version 8, SAS will not compress a SAS data set when the result would be a larger file.
72
73
74
Data A (COMPRESS= YES);
SET SASHELP.ACCPEO;
RUN;
2
NESUG 2006
Ins & Outs
NOTE: There were 20 observations read from the data set SASHELP.ACCPEO.
NOTE: The data set WORK.A has 20 observations and 3 variables.
NOTE: Compressing data set WORK.A increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
Here are some benchmark results for a large data set that is used at Census Bureau. The compression ratio is
the ratio of the size of the compressed data set to the size of the uncompressed data set.
COMPRESS OPTION
None
Binary
Char
SIZE (BYTES)
80,159,297
21,517,441
37,008,961
COMPRESSION RATIO
-----26.8%
46.1%
SAS VIEW
As an alternative to a SAS data set, one can use a SAS view. SAS Views provide all the functionality of a SAS
data set. A SAS View contains only the instructions that are required for retrieving data values from other SAS
data sets or files, and it occupies only a little fraction of the space required by the SAS data set.
A SAS View can be created with data step or with a PROC SQL. Following is an example of a SAS View created
with data step.
10
11
12
Data B /view = B;
set sashelp.eismsg;
run;
NOTE: DATA STEP view saved on file WORK.B.
A PROC SQL View can read data from DATA step Views, SAS data sets, other PROC SQL views, ORACLE or
other DBMS data.
62
Proc sql;
63
Create view AB as
64
select var1, var2, var3
65
from A
66
order by var3, var4;
NOTE: SQL view WORK.AB has been defined.
67
quit;
In the above example A can be a SAS view, SAS data set, ORACLE table or any other DBMS table.
Starting with Version 8, DATA step View retains source statements. One can retrieve these statements as in the
following example.
32
33
34
data view=B;
describe;
run;
NOTE: DATA step view WORK.B is defined as:
data B/view=B;
set sashelp.eismsg;
run;
To retrieve the source statements for an SQL View, one needs to use SQL as in the following example.
3
NESUG 2006
Ins & Outs
68
proc sql;
69
describe view AB;
NOTE: SQL view WORK.AB is defined as:
select var1, var2, var3
from A
order by var3 asc, var4 asc;
70
quit;
SQUEEZING A SAS DATA SET
When a large dataset is created, most often it is difficult to find the minimum length required by an individual
variable. The following macros can find the minimum lengths required by numeric or character variables for a SAS
data set and use these lengths to reduce the size the data set. These macros could greatly reduce the storage
space required by a SAS dataset.
%SQUEEZE
%SQUEEZE macro created by Ross Bettinger (see Reference 1) can squeeze a data set by reducing the space
required by numeric and character variables. If you do not want to squeeze some variables, you have the option
of doing so by using a parameter in the %SQUEEZE macro. If you do not include the highlighted part of the code
in Appendix A, you would have the code for %SQUEEZE macro.
%SQUEEZE_1
The %SQUEEZE macro is modified slightly here to create a macro %SQUEEEZE_1. %SQUEEZE macro checks
all numeric variables to find the minimum lengths required by these variables by repeated use of TRUNC function
on each and every value of these variables. You do not need to find the minimum length of a numeric variable if
its length is already three, and sometimes you do not need to apply the TRUNC function on each and every value
of a numeric variable. %SQUEEZE_1 incorporates these improvements. This macro generally runs faster than
%SQUEEZE and it runs at least as fast as %SQUEEZE. The squeezing technique that is discussed here may be
used on integer valued numeric variable, but should not be used on non-integer valued numeric variables. If you
use this technique on non-integer valued numeric variables, you might lose some accuracy for these variables.
The code for this macro is given in Appendix A.
%SQUEEZE_1
No
Yes
SIZE (bytes)
24,969,537
21,515,265
SQUEEZED RATIO
-----86.1%
COMPARING %SQUEEZE AND %SQUEEZE_1
%SQUEEZE and %SQUEEZE_1 are used to squeeze some large SAS data set of size 24969537 bytes to come
with the results below.
Macros
%SQUEEZE
%SQUEEZE_1
TIME (minutes)
126
108
DROPING VARIABLES WITH ONLY MISSING VALUES
When all the values in some numeric or character variables are missing, deleting these variables can save a large
amount of disk space. But sometimes you want to keep some variables even though all the values for these
variables are missing. A macro %DROPMISS (see Appendix B) that is developed here will automatically drop the
variables in a data set that have always missing values. You have the option of not dropping the variables you do
not want to drop by using a parameter in the %DROPMISS macro. This macro is more efficient than the program
in Reference 2. The table below gives the results for both programs for a SAS data set.
4
NESUG 2006
Ins & Outs
Programs
Program in Sample 53
%DROPMISS
SIZE (bytes)
567,889
567,889
TIME (minutes)
13.2
3.3
COMBINING %SQUEEZE_1 AND %DROPMISS
Combining %SQUEEZE_1 and %DROPMISS, another macro %SQ_DROPMISS (see Appendix C) is created. To
save processing time, instead of using %SQUEEZE_1 and %DROPMISS on a SAS data set, it would be better to
use %SQ_DROPMISS. When the methods in %SQUEEZE are used on a SAS data set, these methods squeeze
the lengths of the character variables that are always missing to 1, and squeeze the lengths of the numeric
variables that are always missing to 3. So, after applying the methods in the %SQUEEZE macro to SAS data set,
to drop the variables that have all the values missing we need to check only the character variables with length 1
and numeric variables with length 3 for the variables that have always missing values. This saves a great amount
of processing time.
CONCLUSIONS
There are many ways to reduce the size of a data set that you want to store. Some SAS options and SAS
statements, SAS Views, and some macros discussed in this paper can be used to reduce the space required by a
SAS data set. Instead of using %SQUEEZE_1 and %DROPMISS for a data set, it would be better to use the
macro %SQ_DROPMISS to save processing time.
REFERENCES
1. Ross Bettinger, Sample 267: %SQUEEZE-ing before Compressing Data, Redux.
7 Jul. 2006 <http://support.sas.com/ctx/samples/index.jsp?sid=267>
2. Sample 53: Delete variables that have only missing data.
7 Jul. 2006 <http://support.sas.com/ctx/samples/index.jsp?sid=53&tab=code>
ACKNOWLEDGMENTS
We would like to thank David Chapman for offering valuable suggestions and comments.
SAS is a Registered Trademark of the SAS Institute, Inc. of Cary, North Carolina.
DISCLAIMER
This paper reports the results of research and analysis undertaken by Census Bureau staff. It has undergone
a more limited review by the Census Bureau than its official publications. This report is released to inform
interested parties and to encourage discussion.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Selvaratnam Sridharma
Economic Planning and Coordination Division
U.S. Bureau of the Census
Washington, DC 20233-6100
301-763-6774
Email: selvaratnam.sridharma@census.gov
5
NESUG 2006
Ins & Outs
APPENDIX A:
%SQUEEZE_1
%macro SQUEEZE_1( DSNIN
/* name of input SAS dataset
*/
, DSNOUT
/* name of output SAS dataset
*/
, NOCOMPRESS= /* [optional] variables to be omitted from the
minimum-length computation process
*/
);
/* PURPOSE: create LENGTH statement for vars that minimizes the variable length
* to:
* numeric vars: the fewest # of bytes needed to exactly represent the values
* contained in the variable
* character vars: the fewest # of bytes needed to contain the longest
* character string
*
* macro variable SQZLENTH is created which is then invoked in a subsequent
* data step
*
* NOTE:
if no char vars in dataset, produce no char var processing code
* NOTE:
length of format for char vars is changed to match computed length
* of char var
* e.g., if length( CHAR_VAR ) = 10 after %SQUEEZE-ing, then FORMAT CHAR_VAR
* $10. ; is generated
* NOTE:
variables in &DSNOUT are maintained in same order as in &DSNIN
* NOTE:
variables named in &NOCOMPRESS are not included in the minimum* length computation process and keep their original lengths as specified in
* &DSNIN
*
* EXAMPLE OF USE:
*
%SQUEEZE( DSNIN, DSNOUT )
*
%SQUEEZE( DSNIN, DSNOUT, NOCOMPRESS=A B C D--H X1-X100 )
*
%SQUEEZE( DSNIN, DSNOUT, NOCOMPRESS=_numeric_
)
*
%SQUEEZE( DSNIN, DSNOUT, NOCOMPRESS=_character_
)
*/
%global SQUEEZE ;
%local I ;
%if "&DSNIN" = "&DSNOUT" %then %do ;
%put
%put
%put
%put
%put
/------------------------------------------------\
| ERROR from SQUEEZE:
|
| Input Dataset has same name as Output Dataset. |
| Execution terminating forthwith.
|
\------------------------------------------------/
;
;
;
;
;
%goto L9999 ;
%end ;
/*###############################################################################*/
/* begin executable code
/*###############################################################################*/
6
NESUG 2006
Ins & Outs
/*===============================================================================*/
/* Find the first positive integer n such that n+1 needs more than 3 bytes
/* Negative of this number will be the first negative integer n such that n-1
/* needs more than 3 bytes
/*===============================================================================*/
data x;
do i=1 to 10000;
a=trunc(i,3);
if a ^=i then do; call symput ('max_3' , a);
output; stop;
end;
end;
run;
/*===============================================================================*/
/* create dataset of variable names whose lengths are to be minimized
/* exclude from the computation all names in &NOCOMPRESS
/*===============================================================================*/
proc contents data=&DSNIN( drop=&NOCOMPRESS ) memtype=data noprint
out=_cntnts_( keep= name type LENGTH) ; run ;
%let N_CHAR = 0 ;
%let N_NUM = 0 ;
data _null_ ;
set _cntnts_ end=lastobs nobs=nobs ;
WHERE (TYPE =1 AND LENGTH ^= 3) OR (TYPE =2 AND LENGTH ^=1);
if nobs = 0 then stop ;
n_char + ( type = 2 ) ;
n_num + ( type = 1 ) ;
/* create macro vars containing final # of char, numeric variables */
if lastobs then do ;
call symput( 'N_CHAR', left( put( n_char, 5. ))) ;
call symput( 'N_NUM' , left( put( n_num , 5. ))) ;
end ;
run ;
/*===============================================================================*/
/* if there are NO numeric or character vars in dataset, stop further
/* processing
/*===============================================================================*/
%if %eval( &N_NUM + &N_CHAR ) = 0 %then %do ;
%put /----------------------------------\ ;
%put | ERROR from SQUEEZE:
| ;
%put | No variables in dataset.
| ;
%put | Execution terminating forthwith. | ;
%put \----------------------------------/ ;
%goto L9999 ;
%end ;
/*===============================================================================*/
/* put global macro names into global symbol table for later retrieval
/*===============================================================================*/
%do I = 1 %to &N_NUM ;
%global NUM&I NUMLEN&I ;
%end ;
%do I = 1 %to &N_CHAR ;
%global CHAR&I CHARLEN&I ;
%end ;
/*===============================================================================*/
7
NESUG 2006
Ins & Outs
/* create macro vars containing variable names
/* efficiency note: could compute n_char, n_num here, but must declare macro
/* names to be global b4 stuffing them
/* note: if no char vars in data, do not create macro vars
/*===============================================================================*/
proc sql noprint ;
%if &N_CHAR > 0 %then %str( select name into :CHAR1 - :CHAR&N_CHAR from
_cntnts_ where type = 2 AND LENGTH NE 1; ) ;
%if &N_NUM > 0 %then %str( select name into :NUM1 - :NUM&N_NUM
from
_cntnts_ where type = 1 AND LENGTH NE 3; ) ;
quit ;
/*===============================================================================*/
/* compute min # bytes (3 = min length, for portability over platforms) for
/* numeric vars compute min # bytes to keep rightmost character for char vars
/*===============================================================================*/
data _null_ ;
set &DSNIN end=lastobs ;
%if &N_NUM > 0 %then %str ( array _num_len_ ( &N_NUM ) 3 _temporary_ ; ) ;
%if &N_CHAR > 0 %then %str( array _char_len_ ( &N_CHAR ) _temporary_ ; ) ;
if _n_ = 1 then do;
%if &N_CHAR > 0 %then %str( do i = 1 to &N_CHAR ; _char_len_( i ) = 0 ;
end ; ) ;
%if &N_NUM > 0 %then %str( do i = 1 to &N_NUM ; _num_len_ ( i ) = 3 ;
end ; ) ;
end ;
%if &N_CHAR > 0 %then %do I = 1 %to &N_CHAR ;
_char_len_( &I ) = max( _char_len_( &I ), length( &&CHAR&I )) ;
%end ;
%if &N_NUM > 0
%then %do I = 1 %to &N_NUM ;
if &&NUM&I ne . THEN DO;
IF ( &&NUM&I > &max_3 OR &&NUM&I < -&max_3) THEN DO;
if &&NUM&I ne trunc( &&NUM&I, 7 ) then _num_len_( &I
max( _num_len_( &I ), 8 ) ; else
if &&NUM&I ne trunc( &&NUM&I, 6 ) then _num_len_( &I
max( _num_len_( &I ), 7 ) ; else
if &&NUM&I ne trunc( &&NUM&I, 5 ) then _num_len_( &I
max( _num_len_( &I ), 6 ) ; else
if &&NUM&I ne trunc( &&NUM&I, 4 ) then _num_len_( &I
max( _num_len_( &I ), 5 ) ; else
if &&NUM&I ne trunc( &&NUM&I, 3 ) then _num_len_( &I
max( _num_len_( &I ), 4 ) ;
end ;
end;
%end ;
) =
) =
) =
) =
) =
if lastobs then do ;
%if &N_CHAR > 0 %then %do I = 1 %to &N_CHAR ;
call symput( "CHARLEN&I", put( _char_len_( &I ), 5. )) ;
%end ;
%if &N_NUM > 0 %then %do I = 1 %to &N_NUM ;
call symput( "NUMLEN&I", put( _num_len_( &I ), 1. )) ;
%end ;
end ;
run ;
8
NESUG 2006
Ins & Outs
proc datasets nolist ; delete _cntnts_ ; run ;
/*===============================================================================*/
/* initialize SQZ_NUM, SQZ_CHAR global macro vars
/*===============================================================================*/
%let SQZ_NUM
= LENGTH ;
%let SQZ_CHAR
= LENGTH ;
%let SQZ_CHAR_FMT = FORMAT ;
%if &N_CHAR > 0 %then %do I = 1 %to &N_CHAR ;
%let SQZ_CHAR
= &SQZ_CHAR %qtrim( &&CHAR&I ) $%left( &&CHARLEN&I ) ;
%let SQZ_CHAR_FMT = &SQZ_CHAR_FMT %qtrim( &&CHAR&I ) $%left( &&CHARLEN&I
). ;
%end ;
%if &N_NUM > 0 %then %do I = 1 %to &N_NUM ;
%let SQZ_NUM = &SQZ_NUM %qtrim( &&NUM&I ) &&NUMLEN&I ;
%end ;
/*===============================================================================*/
/* build macro var containing order of all variables
/*===============================================================================*/
data _null_ ;
length retain $32767 ;
retain retain 'retain ' ;
dsid = open( "&DSNIN", 'I' ) ; /* open dataset for read access only */
do _i_ = 1 to attrn( dsid, 'nvars' ) ;
retain = trim( retain ) || ' ' || varname( dsid, _i_ ) ;
end ;
call symput( 'RETAIN', retain ) ;
run ;
/*===============================================================================*/
/* apply SQZ_* to incoming data, create output dataset
/*===============================================================================*/
data &DSNOUT ;
&RETAIN ;
%if &N_CHAR > 0 %then %str( &SQZ_CHAR ; ); /* optimize char var lengths
*/
%if &N_NUM > 0 %then %str( &SQZ_NUM ; ); /* optimize numeric var lengths */
%if &N_CHAR > 0 %then %str( &SQZ_CHAR_FMT ; ) ; /* adjust char var format
lengths */
set &DSNIN ;
run ;
%L9999:
%mend SQUEEZE_1 ;
9
NESUG 2006
Ins & Outs
APPENDIX B:
%DROPMISS
%macro DROPMISS( DSNIN
/* name of input SAS dataset
, DSNOUT
/* name of output SAS dataset
, NODROP= /* [optional] variables to be omitted from dropping
even if they have only missing values
) ;
/* PURPOSE: To find both Character and Numeric the variables that have only
* missing values and drop them if they are not in &NONDROP
*
* NOTE: if no char vars in dataset, produce no char var processing code
*
* EXAMPLE OF USE:
*
%DROP1( DSNIN, DSNOUT )
*
%DROP1( DSNIN, DSNOUT, NODROP=A B C D--H X1-X100 )
*
%DROP1( DSNIN, DSNOUT, NODROP=_numeric_
)
*
%DROP1( DSNIN, DSNOUT, NOdrop=_character_
)
*/
*/
*/
*/
%global DROP1 ;
%local I ;
%if "&DSNIN" = "&DSNOUT" %then %do ;
%put /------------------------------------------------\ ;
%put | ERROR from DROPMISS:
| ;
%put | Input Dataset has same name as Output Dataset. | ;
%put | Execution terminating forthwith.
| ;
%put \------------------------------------------------/ ;
%goto L9999 ;
%end ;
/*###############################################################################*/
/* begin executable code
/*###############################################################################*/
/*===============================================================================*/
/* create dataset of variable names that have only missing values
/* exclude from the computation all names in &NODROP
/*===============================================================================*/
proc contents data=&DSNIN( drop=&NODROP ) memtype=data noprint out=
_cntnts_( keep= name type ) ; run ;
%let N_CHAR = 0 ;
%let N_NUM = 0 ;
data _null_ ;
set _cntnts_ end=lastobs nobs=nobs ;
if nobs = 0 then stop ;
n_char + ( type = 2 ) ;
n_num + ( type = 1 ) ;
/* create macro vars containing final # of char, numeric variables */
if lastobs then do ;
call symput( 'N_CHAR', left( put( n_char, 5. ))) ;
call symput( 'N_NUM' , left( put( n_num , 5. ))) ;
end ;
10
NESUG 2006
Ins & Outs
run ;
/*===============================================================================*/
/* if there are NO numeric or character vars in dataset, stop further */
/*===============================================================================*/
%if %eval( &N_NUM + &N_CHAR ) = 0 %then %do ;
%put /----------------------------------\ ;
%put | ERROR from DROP1:
| ;
%put | No variables in dataset.
| ;
%put | Execution terminating forthwith. | ;
%put \----------------------------------/ ;
%goto L9999 ;
%end ;
/*===============================================================================*/
/* put global macro names into global symbol table for later retrieval
*/
/*===============================================================================*/
%do I = 1 %to &N_NUM ;
%global NUM&I ;
%end ;
%do I = 1 %to &N_CHAR ;
%global CHAR&I ;
%end ;
/*===============================================================================*/
/* create macro vars containing variable names
/* efficiency note: could compute n_char, n_num here, but must declare macro
/* names to be global b4 stuffing them
/* note: if no char vars in data, do not create macro vars
/*===============================================================================*/
proc sql noprint ;
%if &N_CHAR > 0 %then %str( select name into :CHAR1 - :CHAR&N_CHAR from
_cntnts_ where type = 2 ; ) ;
%if &N_NUM > 0 %then %str( select name into :NUM1
_cntnts_ where type = 1 ; ) ;
quit ;
- :NUM&N_NUM
from
/*===============================================================================*/
/* put MAXIMUM values of the variables into macro variables
/*===============================================================================*/
%IF &N_CHAR > 1 %THEN
%let N_CHAR_1 = %EVAL(&N_CHAR - 1);
%IF &N_NUM > 1 %THEN
%let N_NUM_1 = %EVAL(&N_NUM - 1);
Proc sql ;
%IF &N_NUM >1 %THEN %DO;
%do I= 1 %to &N_NUM_1; max (&&NUM&I),
%END;
%END;
%IF &N_NUM > 0 %THEN %DO;
MAX(&&NUM&N_NUM)
%END;
%IF &N_CHAR >0 AND &N_NUM
>0 %THEN %DO; ,
%END;
%IF &N_CHAR > 1 %THEN %DO;
%do I= 1 %to &N_CHAR_1; max(&&CHAR&I),
%END;
11
NESUG 2006
Ins & Outs
%END;
%IF &N_CHAR >0 %THEN %DO;
MAX(&&CHAR&N_CHAR)
%END; into %IF &N_NUM > 1 %THEN %DO;
%do I= 1 %to &N_NUM_1; :NUMMAX&I,
%END;
%END;
%IF &N_NUM > 0 %THEN %DO;
:NUMMAX&N_NUM
%END;
%IF &N_CHAR> 0 AND &N_NUM >0 %THEN %DO; ,
%END;
%IF &N_CHAR > 1 %THEN %DO;
%do I= 1 %to &N_CHAR_1; :CHARMAX&I,
%END;
%END;
%IF &N_CHAR > 0 %THEN
%DO;:CHARMAX&N_CHAR
%END;
from &DSNIN;
/*===============================================================================*/
/* initialize DROP_NUM, DROP_CHAR global macro vars
/*===============================================================================*/
%let DROP_NUM
= ;
%let DROP_CHAR
= ;
%if &N_CHAR > 0 %THEN %DO;
%do I = 1 %to &N_CHAR ;
%IF &&CHARMAX&I =
%THEN %DO;
%let DROP_CHAR
= &DROP_CHAR %qtrim( &&CHAR&I )
%END;
%end ;
%END;
;
%if &N_NUM > 0 %THEN %DO;
%do I = 1 %to &N_NUM ;
%IF &&NUMMAX&I = . %THEN %DO;
%let DROP_NUM = &DROP_NUM %qtrim( &&NUM&I ) ;
%END;
%End ;
%END;
/*===============================================================================*/
/* apply SQZ_* to incoming data, create output dataset
*/
/*===============================================================================*/
data &DSNOUT ;
%if &DROP_CHAR ^= %then %str( DROP &DROP_CHAR ; ) ; /* drop char variables that
have only missing values */
%if &DROP_NUM ^= %then %str( DROP &DROP_NUM ; ) ; /* drop num variables that
have only missing values */
set &DSNIN ;
run ;
%L9999:
%mend DROPMISS ;
12
NESUG 2006
Ins & Outs
APPENDIX C:
%SQ_DROPMISS
OPTIONS MPRINT MLOGIC MSYMBOLGEN;
%macro SQDROPMISS( DSNIN
/* name of input SAS dataset
, DSNOUT
/* name of output SAS dataset
*/
*/
, NOCOMPRESS= /* [optional] variables to be omitted from the
minimum-length computation process
*/
, NODROP= /* [optional] variables to be omitted from droping even
if they have only missing values
) ;
PURPOSE: Squeeze a data set to have minimum lengths required for the
variables excluding the variables in &NOCOMPRESS applying %SQUEEZE_1 and
then DROP the variables that have always missing values in a more
efficient way.
/*
*
*
*
*
* EXAMPLE OF USE:
*
%SQ_DROPMISS( DSNIN, DSNOUT, NOCOMPRESS= )
*
%SQ_DROPMISS( DSNIN, DSNOUT, NOCOMPRESS=A B C D--H X1-X100 )
*
%SQ_DROPMISS( DSNIN, DSNOUT, NOCOMPRESS=_numeric_ )
*
%SQ_DROPMISS DSNIN, DSNOUT, NOCOMPRESS=_character_
*
%SQ_DROPMISS DSNIN, DSNOUT, NOCOMPRESS=_character_, NONDROP= A C D)
*/
/*###############################################################################*/
/* begin executable code
/*###############################################################################*/
/*===============================================================================*/
/* Squeezing part
/*===============================================================================*/
/*===============================================================================*/
/* Include the code for the macro %SQUEEZE_1 here
*/
/*===============================================================================*/
%SQUEEZE_1 (&DSNIN, DSNSQUEEZED, &NOCOMPRESS);
/*===============================================================================*/
/* Dropping part
/*===============================================================================*/
%global DROP1 ;
%local I ;
%if "&DSNIN" = "&DSNOUT" %then %do ;
%put /------------------------------------------------\ ;
%put | ERROR from DROPMISS:
| ;
%put | Input Dataset has same name as Output Dataset. | ;
%put | Execution terminating forthwith.
| ;
%put \------------------------------------------------/ ;
%goto L9999 ;
%end ;
/*===============================================================================*/
/* create dataset of variable names that have only missing values
/* exclude from the computation all names in &NODROP
/*===============================================================================*/
proc contents data=DSNSQUEEZED( drop=&NODROP ) memtype=data noprint out=
_cntnts_( keep= name type length) ; run ;
13
NESUG 2006
Ins & Outs
%let N_CHAR = 0 ;
%let N_NUM = 0 ;
data _null_ ;
set _cntnts_ end=lastobs nobs=nobs ;
where (type =1 and length =3) or (type=2 and length =1);
if nobs = 0 then stop ;
n_char + ( type = 2 ) ;
n_num + ( type = 1 ) ;
/* create macro vars containing final # of char, numeric variables */
if lastobs then do ;
call symput( 'N_CHAR', left( put( n_char, 5. ))) ;
call symput( 'N_NUM' , left( put( n_num , 5. ))) ;
end ;
run ;
/*===============================================================================*/
/* if there are NO numeric or character vars in dataset, stop further */
/*===============================================================================*/
%if %eval( &N_NUM + &N_CHAR ) = 0 %then %do ;
%put /----------------------------------\ ;
%put | ERROR from DROP1:
| ;
%put | No variables in dataset to drop.
| ;
%put | Execution terminating forthwith. | ;
%put \----------------------------------/ ;
%goto L9999 ;
%end ;
/*===============================================================================*/
/* put global macro names into global symbol table for later retrieval
*/
/*===============================================================================*/
%do I = 1 %to &N_NUM ;
%global NUM&I ;
%end ;
%do I = 1 %to &N_CHAR ;
%global CHAR&I ;
%end ;
/*===============================================================================*/
/* create macro vars containing variable names
/* efficiency note: could compute n_char, n_num here, but must declare macro
/* names to be global b4 stuffing them
/* note: if no char vars in data, do not create macro vars
/*===============================================================================*/
proc sql noprint ;
%if &N_CHAR > 0 %then %str( select name into :CHAR1 - :CHAR&N_CHAR from
_cntnts_ where type = 2 ; ) ;
%if &N_NUM > 0 %then %str( select name into :NUM1 - :NUM&N_NUM
from
_cntnts_ where type = 1 ; ) ;
quit ;
/*===============================================================================*/
/* put MAXIMUM values of the variables into macro variables
/*===============================================================================*/
%IF &N_CHAR > 1 %THEN
%let N_CHAR_1 = %EVAL(&N_CHAR - 1);
%IF &N_NUM > 1 %THEN
%let N_NUM_1 = %EVAL(&N_NUM - 1);
14
NESUG 2006
Ins & Outs
Proc sql ;
select
%IF &N_NUM >1 %THEN %DO;
%do I= 1 %to &N_NUM_1; max (&&NUM&I),
%END;
%END;
%IF &N_NUM > 0 %THEN %DO;
MAX(&&NUM&N_NUM)
%END;
%IF &N_CHAR >0 AND &N_NUM
>0 %THEN %DO; ,
%END;
%IF &N_CHAR > 1 %THEN %DO;
%do I= 1 %to &N_CHAR_1; max(&&CHAR&I),
%END;
%END;
%IF &N_CHAR >0 %THEN %DO;
MAX(&&CHAR&N_CHAR)
%END; into %IF &N_NUM > 1 %THEN %DO;
%do I= 1 %to &N_NUM_1; :NUMMAX&I,
%END;
%END;
%IF &N_NUM > 0 %THEN %DO;
:NUMMAX&N_NUM
%END;
%IF &N_CHAR> 0 AND &N_NUM >0 %THEN %DO; ,
%END;
%IF &N_CHAR > 1 %THEN %DO;
%do I= 1 %to &N_CHAR_1; :CHARMAX&I,
%END;
%END;
%IF &N_CHAR > 0 %THEN
%DO;:CHARMAX&N_CHAR
%END;
from &DSNIN;
quit;
/*===============================================================================*/
/* initialize DROP_NUM, DROP_CHAR global macro vars
/*===============================================================================*/
%let DROP_NUM
= ;
%let DROP_CHAR
= ;
%if &N_CHAR > 0 %THEN %DO;
%do I = 1 %to &N_CHAR ;
%IF &&CHARMAX&I =
%THEN %DO;
%let DROP_CHAR
= &DROP_CHAR %qtrim( &&CHAR&I )
%END;
%end ;
%END;
%if &N_NUM > 0 %THEN %DO;
%do I = 1 %to &N_NUM ;
%IF &&NUMMAX&I = . %THEN %DO;
%let DROP_NUM = &DROP_NUM %qtrim( &&NUM&I ) ;
%END;
15
;
NESUG 2006
Ins & Outs
%End ;
%END;
/*===============================================================================*/
/* apply Drop_* to incoming data, create output dataset
*/
/*===============================================================================*/
data &DSNOUT ;
%if &DROP_CHAR ^= %then %str( DROP &DROP_CHAR ; ) ; /* drop char variables that
have only missing values */
%if &DROP_NUM ^= %then %str( DROP &DROP_NUM ;) ; /* drop num variables that
have only missing values */
set DSNSQUEEZED ;
run ;
%L9999:
%mend SQDROPMISS ;
16
Download