DATA Step MODIFY and UPDATE Statements: Do They Preserve Integrity Constraints? Baskar Anjappan, MUFG Union Bank, N.A., San Francisco, California ABSTRACT Programmers frequently receive data from various internal and external sources. It is a good practice to validate the integrity constraints of the data that one receives. Unfortunately, the DATA step with SET and MERGE do not preserve constraints. To retain the integrity constraints one must use the appropriate technique. In this paper a sample or base data set is created and IC (integrity constraints) are defined by using PROC DATA SETS. The same ® base data set is then used to manipulate the data using two different techniques: 1) MODIFY statement in a SAS DATA step, 2) UPDATE statement in a SAS DATA step. Which – if any – of these methods will preserve the integrity constraints? INTRODUCTION It is often the case that variables in a data set are expected to have a set of discrete values or a range of values. Consider, for example, data/variables related to loan applications at a bank. The data set variable id assigned to each customer must be unique and the age range for people applying for loan in a bank is approximately from 18- 85. Certain financial measures of risk that are calculated as part of the loan application process have natural but constrained ranges. Integrity constraints provide a way to restrict the values of a variable in a data set. Here we work with a sample data set with artificial data, for example: LGD grade (Loss Given Default) has internal ranges from A to C, PD (Probability of Default) risk rating and financial factors values must be in the range of 1 to 20. Also, integrity constraint (IC) variable val_max is the sum of PD risk rating and fin_factor1, the total must not exceed 25. It is well-known that the DATA step SET and MERGE operations do not preserve integrity constraints. The DATA step MODIFY operation does a change-in-place so intuitively one expects that this operation will preserve integrity constraints. The DATA step UPDATE operation can read and write data in a single step, and it is unclear whether it will preserve integrity constraints. In this paper we transform or update sample base data sets using both DATA step MODIFY and UPDATE statements and observe the changes, to determine whether integrity constraints are preserved or not. BACKGROUND: INTEGRITY CONSTRAINTS Integrity constraints are a set of validation rules that can be specified to restrict the data values accepted into a SAS data file. I am creating different types of integrity constraint in this example like primary key, not null and check. For validation purpose I will try to manipulate data on check integrity constraint by using modify and update statement. 1 Use MODIFY statement to update the data set: ® Input data: Simple sample SAS data set and construct the integrity constraints on top of it. Figure 1: Sample pd_lgd_rating_sample data set Obs idnum age sex lgd_grade pd_riskrating fin_factor1 1 1 20 M A 10 5 2 2 36 F B 10 10 3 3 42 M C 10 5 4 4 18 F A 5 10 5 5 84 F C 19 1 Source code to define and apply the integrity constraints: proc datasets nolist; modify pd_lgd_rating_sample; ic create PKIDInfo=primary key(idnum) message='idnum values are unique'; ic create val_sex = check(where=(sex in ('M','F'))) message = "Valid values for variable SEX are either 'M' or 'F'."; ic create val_age = not null(age) message = "AGE cannot be missing"; ic create val_pd = check(where=(pd_riskrating <= 20)); ic create val_fin = check(where=(fin_factor1 <= 10)); ic create val_grade = check(where=(lgd_grade in('A','B','C'))); ic create val_max = check(where=((pd_riskrating+fin_factor1)<= 25)); quit; Next, we use PROC CONTENTS with the OUT2= option to verify that the integrity constraints have been registered in the file metadata. Figure 2: Contents of pd_lgd_rating_sample data set Alphabetic List of Variables and Attributes # Variable Type Len 2 age Num 4 6 fin_factor1 Num 8 1 idnum Num 4 4 lgd_grade Char 1 5 pd_riskrating Num 8 3 sex Char 1 2 Alphabetic List of Integrity Constraints Where User Clause Message Integrity Constraint Type 1 PKIDInfo Primary Key idnum idnum values are unique 2 val_age Not Null age AGE cannot be missing 3 val_fin Check fin_factor1<=10 4 val_grade Check lgd_grade in ('A', 'B', 'C') 5 val_max Check (pd_riskrating+fin_factor1)<=25 6 val_pd Check pd_riskrating<=20 7 val_sex Check sex in ('F', 'M') # Variables Valid values for variable SEX are either 'M' or 'F'. Source code to test MODIFY operations with the data set: data pd_lgd_rating_sample; modify pd_lgd_rating_sample; if if if if idnum idnum idnum idnum = = = = 1 1 2 5 then then then then sex = "G"; lgd_grade ="Z"; sex = "M"; fin_factor1 = 7; run; SAS Log for code: 46 47 48 49 50 51 52 53 54 data pd_lgd_rating_sample; modify pd_lgd_rating_sample; if idnum if idnum if idnum if idnum run; = = = = 1 1 2 5 then then then then sex = "G"; lgd_grade ="Z"; sex = "M"; fin_factor1 = 7; idnum=1 age=20 sex=G lgd_grade=Z pd_riskrating=10 fin_factor1=5 _ERROR_=1 _IORC_=660130 _N_=1 idnum=5 age=84 sex=F lgd_grade=C pd_riskrating=19 fin_factor1=7 _ERROR_=1 _IORC_=660130 _N_=5 NOTE: There were 5 observations read from the data set WORK.PD_LGD_RATING_SAMPLE. NOTE: The data set WORK.PD_LGD_RATING_SAMPLE has been updated. There were 3 observations rewritten, 0 observations added and 0 observations deleted. NOTE: There were 2 rejected updates, 0 rejected adds, and 0 rejected deletes. NOTE: DATA statement used (Total process time): The log shows that 3 observations were rewritten, meaning updated with the new value passed in the DATA step and 2 updates were rejected, due to violation on the integrity constraints set it up on the data set. The values of variable sex must be either M or F; the value of lgd_grade must be A, B or C, it cannot be Z. Referring to idnum = 1 row passing on lgd_grade = “Z” got rejected. Referring to idnum = 5 row, pd_riskrating = 19; fin_factor1 = 1; values of IC 3 val_max (pd_riskrating+fin_factor1) must be less than or equal to 25; when I change the variable fin_factor1 value from 1 to 7 makes the total of 19+7= 27, which is out of the range, hence this update is rejected. The only change that happened in this MODIFY statement DATA step was idnum = 2 where sex has been changed from F to M. This change is within the range of the integrity constraints. The output data set has only one actual change and the remaining changes did not have any impact. Figure 3: Output results for pd_lgd_rating_sample data set. Obs idnum age sex lgd_grade pd_riskrating fin_factor1 1 1 20 M A 10 5 2 2 36 M B 10 10 3 3 42 M C 10 5 4 5 4 18 F A 5 10 5 84 F C 19 1 The integrity constraints were verified using PROC CONTENTS OUT2= option; this showed that the original integrity constraints were preserved. Use UPDATE statement to change the data set Second sample data set: A second sample data set was created with the same data structure. This data set is used in a DATA step with the UPDATE statement to test if integrity constraints are preserved when using UPDATE. Figure 4: Second input pd_lgd_rating_sample2 data set. Obs idnum age sex lgd_grade 1 1 2 2 3 pd_riskrating fin_factor1 20 G Z 10 5 36 M B 10 10 3 42 M C 10 5 4 4 18 F A 5 10 5 5 84 G C 19 7 The objective of this exercise is to update the values on the original data set using data from a second sample data set. Some of the values in the second data set are out of range. Specifically, some values in the second data set are violating the integrity constraints of the original data set. The sample updates are similar to those in the previous example: valid values in variable sex are M and F, anything other than these two would be invalid. Referring to idnum = 5 row, pd_riskrating = 19; fin_factor1 = 1; values of IC val_max (pd_riskrating+fin_factor1) must be less than or equal to 25; in the merge if we try to change the variable fin_factor1 value from 1 to 7 makes the total of 19+7= 27, which is out of the range but still the update statement will overwrite the variable fin_factor1 value from 1 to 7. 4 Source code for test of UPDATE statement: data pd_lgd_rating_sample; update pd_lgd_rating_sample pd_lgd_rating_sample2; by idnum; run; Figure 5: Output of pd_lgd_rating_sample data set Obs idnum age sex lgd_grade pd_riskrating fin_factor1 1 1 20 G Z 10 5 2 2 36 M B 10 10 3 3 42 M C 10 5 4 5 4 18 F A 5 10 5 84 G C 19 7 Next, use PROC CONTENTS with the OUT2= option to verify that the integrity constraints have been registered in the file metadata. The OUT2= file is empty, has no rows, no variables – hence no integrity constraints. This shows that the ICs were not preserved by UPDATE. Figure 6: Contents of pd_lgd_rating_sample data set after update. Alphabetic List of Variables and Attributes # 2 Variable age Type Num Len 4 6 fin_factor1 Num 8 1 idnum Num 4 4 lgd_grade Char 1 5 pd_riskrating Num 8 3 sex Char 1 Source code for testing: - The source code for the tasks/projects has been uploaded to sascommunity.org: http://www.sascommunity.org/mwiki/images/b/b2/Modify_update_WUSS_paper_2014.sas CONCLUSION: It is always a good practice to define the integrity constrains on the data source data upstream and later the same data get merge/join in various data source across the organization. Having the legitimate integrity constrains in upstream data source helps in preventing wrong manipulation of the data downstream. Using the appropriate technique to define the integrity constrain is important. Two different techniques have been used in this paper to monitor the differences. Modify statement preserve the integrity constraint but update statement would not preserve. Using update statement not only you lose the integrity constrain also the base data set is overwritten by the other data set. 5 References SAS(R) 9.2 Language Reference: Concepts, Second Edition https://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000403555.htm SAS(R) 9.2 SQL Procedure User's Guide http://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/viewer.htm#a001396785.htm Acknowledgment: Thanks to Thomas E. Billings (of MUFG Union Bank, N.A..) for valuable comments and suggestions. Any errors herein are solely the responsibility of the author. Contact Information: Baskar Anjappan MUFG Union Bank, N.A. Basel II - Retail Credit BTMU 350 California St.; 9th floor MC H-925 San Francisco, CA 94104 Phone: 415-273-2472 Email: baskarworld@hotmail.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 6