DATA Step MODIFY and UPDATE Statements: Do They Preserve

advertisement
DATA Step MODIFY and UPDATE Statements:
Do They Preserve Integrity Constraints?
Baskar Anjappan, MUFG Union Bank, N.A., San Francisco, California
ABSTRACT
Programmers frequently receive data from various internal and external sources. It is a good practice to validate the
integrity constraints of the data that one receives. Unfortunately, the DATA step with SET and MERGE do not
preserve constraints. To retain the integrity constraints one must use the appropriate technique. In this paper a
sample or base data set is created and IC (integrity constraints) are defined by using PROC DATA SETS. The same
®
base data set is then used to manipulate the data using two different techniques: 1) MODIFY statement in a SAS
DATA step, 2) UPDATE statement in a SAS DATA step. Which – if any – of these methods will preserve the integrity
constraints?
INTRODUCTION
It is often the case that variables in a data set are expected to have a set of discrete values or a range of values.
Consider, for example, data/variables related to loan applications at a bank. The data set variable id assigned to each
customer must be unique and the age range for people applying for loan in a bank is approximately from 18- 85.
Certain financial measures of risk that are calculated as part of the loan application process have natural but
constrained ranges. Integrity constraints provide a way to restrict the values of a variable in a data set.
Here we work with a sample data set with artificial data, for example: LGD grade (Loss Given Default) has internal
ranges from A to C, PD (Probability of Default) risk rating and financial factors values must be in the range of 1 to 20.
Also, integrity constraint (IC) variable val_max is the sum of PD risk rating and fin_factor1, the total must not exceed
25.
It is well-known that the DATA step SET and MERGE operations do not preserve integrity constraints. The DATA
step MODIFY operation does a change-in-place so intuitively one expects that this operation will preserve integrity
constraints. The DATA step UPDATE operation can read and write data in a single step, and it is unclear whether it
will preserve integrity constraints. In this paper we transform or update sample base data sets using both DATA step
MODIFY and UPDATE statements and observe the changes, to determine whether integrity constraints are
preserved or not.
BACKGROUND: INTEGRITY CONSTRAINTS
Integrity constraints are a set of validation rules that can be specified to restrict the data values accepted into a SAS
data file. I am creating different types of integrity constraint in this example like primary key, not null and check. For
validation purpose I will try to manipulate data on check integrity constraint by using modify and update statement.
1
Use MODIFY statement to update the data set:
®
Input data: Simple sample SAS data set and construct the integrity constraints on top of it.
Figure 1: Sample pd_lgd_rating_sample data set
Obs
idnum
age
sex
lgd_grade
pd_riskrating
fin_factor1
1
1
20
M
A
10
5
2
2
36
F
B
10
10
3
3
42
M
C
10
5
4
4
18
F
A
5
10
5
5
84
F
C
19
1
Source code to define and apply the integrity constraints:
proc datasets nolist;
modify pd_lgd_rating_sample;
ic create PKIDInfo=primary key(idnum)
message='idnum values are unique';
ic create val_sex = check(where=(sex in ('M','F')))
message = "Valid values for variable SEX are either 'M' or 'F'.";
ic create val_age = not null(age)
message = "AGE cannot be missing";
ic create val_pd = check(where=(pd_riskrating <= 20));
ic create val_fin = check(where=(fin_factor1 <= 10));
ic create val_grade = check(where=(lgd_grade in('A','B','C')));
ic create val_max = check(where=((pd_riskrating+fin_factor1)<= 25));
quit;
Next, we use PROC CONTENTS with the OUT2= option to verify that the integrity constraints have been registered in
the file metadata.
Figure 2: Contents of pd_lgd_rating_sample data set
Alphabetic List of Variables and Attributes
#
Variable
Type
Len
2
age
Num
4
6
fin_factor1
Num
8
1
idnum
Num
4
4
lgd_grade
Char
1
5
pd_riskrating
Num
8
3
sex
Char
1
2
Alphabetic List of Integrity Constraints
Where
User
Clause
Message
Integrity
Constraint
Type
1
PKIDInfo
Primary Key
idnum
idnum values are unique
2
val_age
Not Null
age
AGE cannot be missing
3
val_fin
Check
fin_factor1<=10
4
val_grade
Check
lgd_grade in ('A', 'B', 'C')
5
val_max
Check
(pd_riskrating+fin_factor1)<=25
6
val_pd
Check
pd_riskrating<=20
7
val_sex
Check
sex in ('F', 'M')
#
Variables
Valid values for variable
SEX are either 'M' or 'F'.
Source code to test MODIFY operations with the data set:
data pd_lgd_rating_sample;
modify pd_lgd_rating_sample;
if
if
if
if
idnum
idnum
idnum
idnum
=
=
=
=
1
1
2
5
then
then
then
then
sex = "G";
lgd_grade ="Z";
sex = "M";
fin_factor1 = 7;
run;
SAS Log for code:
46
47
48
49
50
51
52
53
54
data pd_lgd_rating_sample;
modify pd_lgd_rating_sample;
if idnum
if idnum
if idnum
if idnum
run;
=
=
=
=
1
1
2
5
then
then
then
then
sex = "G";
lgd_grade ="Z";
sex = "M";
fin_factor1 = 7;
idnum=1 age=20 sex=G lgd_grade=Z pd_riskrating=10 fin_factor1=5 _ERROR_=1
_IORC_=660130 _N_=1
idnum=5 age=84 sex=F lgd_grade=C pd_riskrating=19 fin_factor1=7 _ERROR_=1
_IORC_=660130 _N_=5
NOTE: There were 5 observations read from the data set
WORK.PD_LGD_RATING_SAMPLE.
NOTE: The data set WORK.PD_LGD_RATING_SAMPLE has been updated. There were 3
observations rewritten, 0 observations added and 0
observations deleted.
NOTE: There were 2 rejected updates, 0 rejected adds, and 0 rejected deletes.
NOTE: DATA statement used (Total process time):
The log shows that 3 observations were rewritten, meaning updated with the new value passed in the DATA step and
2 updates were rejected, due to violation on the integrity constraints set it up on the data set. The values of variable
sex must be either M or F; the value of lgd_grade must be A, B or C, it cannot be Z. Referring to idnum = 1 row
passing on lgd_grade = “Z” got rejected. Referring to idnum = 5 row, pd_riskrating = 19; fin_factor1 = 1; values of IC
3
val_max (pd_riskrating+fin_factor1) must be less than or equal to 25; when I change the variable fin_factor1 value
from 1 to 7 makes the total of 19+7= 27, which is out of the range, hence this update is rejected.
The only change that happened in this MODIFY statement DATA step was idnum = 2 where sex has been changed
from F to M. This change is within the range of the integrity constraints. The output data set has only one actual
change and the remaining changes did not have any impact.
Figure 3: Output results for pd_lgd_rating_sample data set.
Obs
idnum
age
sex
lgd_grade
pd_riskrating
fin_factor1
1
1
20
M
A
10
5
2
2
36
M
B
10
10
3
3
42
M
C
10
5
4
5
4
18
F
A
5
10
5
84
F
C
19
1
The integrity constraints were verified using PROC CONTENTS OUT2= option; this showed that the original integrity
constraints were preserved.
Use UPDATE statement to change the data set
Second sample data set: A second sample data set was created with the same data structure. This data set is used
in a DATA step with the UPDATE statement to test if integrity constraints are preserved when using UPDATE.
Figure 4: Second input pd_lgd_rating_sample2 data set.
Obs
idnum
age
sex
lgd_grade
1
1
2
2
3
pd_riskrating
fin_factor1
20
G
Z
10
5
36
M
B
10
10
3
42
M
C
10
5
4
4
18
F
A
5
10
5
5
84
G
C
19
7
The objective of this exercise is to update the values on the original data set using data from a second sample data
set. Some of the values in the second data set are out of range. Specifically, some values in the second data set are
violating the integrity constraints of the original data set. The sample updates are similar to those in the previous
example: valid values in variable sex are M and F, anything other than these two would be invalid. Referring to idnum
= 5 row, pd_riskrating = 19; fin_factor1 = 1; values of IC val_max (pd_riskrating+fin_factor1) must be less than or
equal to 25; in the merge if we try to change the variable fin_factor1 value from 1 to 7 makes the total of 19+7= 27,
which is out of the range but still the update statement will overwrite the variable fin_factor1 value from 1 to 7.
4
Source code for test of UPDATE statement:
data pd_lgd_rating_sample;
update pd_lgd_rating_sample pd_lgd_rating_sample2;
by idnum;
run;
Figure 5: Output of pd_lgd_rating_sample data set
Obs
idnum
age
sex
lgd_grade
pd_riskrating
fin_factor1
1
1
20
G
Z
10
5
2
2
36
M
B
10
10
3
3
42
M
C
10
5
4
5
4
18
F
A
5
10
5
84
G
C
19
7
Next, use PROC CONTENTS with the OUT2= option to verify that the integrity constraints have been registered in
the file metadata. The OUT2= file is empty, has no rows, no variables – hence no integrity constraints. This shows
that the ICs were not preserved by UPDATE.
Figure 6: Contents of pd_lgd_rating_sample data set after update.
Alphabetic List of Variables and Attributes
#
2
Variable
age
Type
Num
Len
4
6
fin_factor1
Num
8
1
idnum
Num
4
4
lgd_grade
Char
1
5
pd_riskrating
Num
8
3
sex
Char
1
Source code for testing: - The source code for the tasks/projects has been uploaded to sascommunity.org:
http://www.sascommunity.org/mwiki/images/b/b2/Modify_update_WUSS_paper_2014.sas
CONCLUSION:
It is always a good practice to define the integrity constrains on the data source data upstream and later the same
data get merge/join in various data source across the organization. Having the legitimate integrity constrains in
upstream data source helps in preventing wrong manipulation of the data downstream. Using the appropriate
technique to define the integrity constrain is important. Two different techniques have been used in this paper to
monitor the differences. Modify statement preserve the integrity constraint but update statement would not preserve.
Using update statement not only you lose the integrity constrain also the base data set is overwritten by the other
data set.
5
References
SAS(R) 9.2 Language Reference: Concepts, Second Edition
https://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000403555.htm
SAS(R) 9.2 SQL Procedure User's Guide
http://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/viewer.htm#a001396785.htm
Acknowledgment:
Thanks to Thomas E. Billings (of MUFG Union Bank, N.A..) for valuable comments and suggestions. Any errors
herein are solely the responsibility of the author.
Contact Information:
Baskar Anjappan
MUFG Union Bank, N.A.
Basel II - Retail Credit BTMU
350 California St.; 9th floor
MC H-925
San Francisco, CA 94104
Phone: 415-273-2472
Email: baskarworld@hotmail.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
6
Download